DS1000: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
claude-3-5-sonnet-20240620	54.3	32.2	1	1.6	NaN	NaN
gpt-4-turbo-2024-04-09	54	31.1	1	1.6	NaN	NaN
deepseek-ai-deepseek-coder-V2-SFT	53.2	30.5	1	1.6	NaN	NaN
Qwen-Qwen2-72B-Instruct	52.8	30.2	1	1.6	NaN	NaN
mistralai-Codestral-22B-v0.1	51.2	28.7	1	1.6	NaN	NaN
gpt-4-0613	51	29.1	1	1.6	NaN	NaN
meta-llama-Llama-3-70b-chat-hf	48.6	27.6	1	1.6	NaN	NaN
deepseek-ai-deepseek-coder-V2-Base	46.7	25.3	1	1.6	NaN	NaN
microsoft-wavecoder-ultra-6.7b	46	25	1	1.6	NaN	NaN
deepseek-ai-deepseek-coder-33b-instruct	45.4	25.2	1	1.6	NaN	NaN
m-a-p-OpenCodeInterpreter-DS-6.7B	42	22.1	1	1.6	NaN	NaN
deepseek-ai-deepseek-coder-33b-base	41.7	20.9	1	1.6	NaN	NaN
meta-llama-Llama-3-70B	40.9	21	1	1.6	NaN	NaN
deepseek-ai-deepseek-llm-67b-chat	40.7	21	1	1.6	NaN	NaN
microsoft-Phi-3-medium-4k-instruct	40.6	20.3	1	1.6	NaN	NaN
Phind-Phind-CodeLlama-34B-v2	40.4	21	1	1.6	NaN	NaN
Qwen-Qwen1.5-110B-Chat	40.2	20.1	1	1.6	NaN	NaN
mistralai-Mixtral-8x22B	40	19.8	1	1.5	NaN	NaN
codellama-CodeLlama-70b-hf	39.8	20.1	1	1.5	NaN	NaN
m-a-p-OpenCodeInterpreter-CL-7B	39.5	21	1	1.5	NaN	NaN
gpt-3.5-turbo-0125	39.4	20.8	1	1.5	NaN	NaN
m-a-p-OpenCodeInterpreter-SC2-7B	38.9	22.1	1	1.5	NaN	NaN
codellama-CodeLlama-34b-Python-hf	38.9	19.4	1	1.5	NaN	NaN
codellama-CodeLlama-70b-Python-hf	38.9	19.5	1	1.5	NaN	NaN
gpt-3.5-turbo-0613	38.6	19.8	1	1.5	NaN	NaN
codex002	38.6	18.5	1	1.5	NaN	NaN
m-a-p-OpenCodeInterpreter-SC2-3B	38.6	20.5	1	1.5	NaN	NaN
deepseek-ai-deepseek-V2-chat	38.5	19.7	1	1.5	NaN	NaN
microsoft-Phi-3-small-8k-instruct	37.7	18.5	1	1.5	NaN	NaN
bigcode-starcoder2-15b	37	17.9	1	1.5	NaN	NaN
WizardLM-WizardCoder-Python-34B-V1.0	36.7	18.2	1	1.5	NaN	NaN
Qwen-Qwen1.5-72B-Chat	35.5	16.6	1	1.5	NaN	NaN
google-codegemma-7b	34.8	15.8	1	1.5	NaN	NaN
ibm-granite-granite-34b-code-base	34.8	16.4	1	1.5	NaN	NaN
codellama-CodeLlama-34b-hf	34.6	15.9	1	1.5	NaN	NaN
Qwen-Qwen1.5-72B	34.3	16	1	1.5	NaN	NaN
deepseek-ai-deepseek-coder-7b-base-v1.5	34.2	15.4	1	1.5	NaN	NaN
ibm-granite-granite-8b-code-base	33.8	15.5	1	1.5	NaN	NaN
Qwen-Qwen1.5-32B-Chat	32.8	15.3	1	1.5	NaN	NaN
microsoft-wavecoder-ds-6.7b	32.8	15.5	1	1.5	NaN	NaN
microsoft-Phi-3-mini-4k-instruct	32.1	15	1	1.5	NaN	NaN
meta-llama-Llama-3-8B	31.5	14.3	1	1.5	NaN	NaN
bigcode-starcoder2-7b	31.4	13.7	1	1.5	NaN	NaN
microsoft-Phi-3-mini-128k-instruct	31.3	14.7	1	1.5	NaN	NaN
microsoft-wavecoder-pro-6.7b	31.2	14.8	1	1.5	NaN	NaN
deepseek-ai-deepseek-coder-6.7b-base	31.1	13.4	1	1.5	NaN	NaN
Qwen-Qwen2-7B	31	13.2	1	1.5	NaN	NaN
codellama-CodeLlama-13b-Python-hf	31	13.5	1	1.5	NaN	NaN
deepseek-ai-deepseek-coder-V2-Lite-Base	30.5	13.3	1	1.5	NaN	NaN
openchat-openchat-3.5-0106	30.3	13.9	1	1.5	NaN	NaN
ibm-granite-granite-20b-code-base	30	13.5	1	1.4	NaN	NaN
google-codegemma-1.1-7b-it	29.7	13.7	1	1.4	NaN	NaN
Doubao-pro-4k	29.1	14	1	1.4	NaN	NaN
mistralai-Mixtral-8x7B-v0.1	28.8	12.4	1	1.4	NaN	NaN
Qwen-Qwen1.5-32B	28.5	12.2	1	1.4	NaN	NaN
codellama-CodeLlama-13b-hf	27.8	11.8	1	1.4	NaN	NaN
Qwen-CodeQwen1.5-7B	27.6	12	1	1.4	NaN	NaN
bigcode-starcoder2-3b	27.3	11.5	1	1.4	NaN	NaN
google-codegemma-7b-it	26.2	11.6	1	1.4	NaN	NaN
google-gemma-7b	26.1	10.7	1	1.4	NaN	NaN
codellama-CodeLlama-7b-Python-hf	26	10.7	1	1.4	NaN	NaN
stabilityai-stable-code-3b	25.6	10.7	1	1.4	NaN	NaN
meta-llama-Llama-2-70b-hf	25.2	10.1	1	1.4	NaN	NaN
m-a-p-OpenCodeInterpreter-DS-1.3B	25	11.3	1	1.4	NaN	NaN
Qwen-Qwen1.5-14B	24.8	9.89	1	1.4	NaN	NaN
THUDM-codegeex2-6b	24.1	9.74	1	1.4	NaN	NaN
deepseek-ai-deepseek-coder-V2-Instruct	23.3	10.9	1	1.3	NaN	NaN
claude-3-sonnet-20240229	23.2	10.3	1	1.3	NaN	NaN
codellama-CodeLlama-7b-hf	22.9	8.85	1	1.3	NaN	NaN
ibm-granite-granite-3b-code-base	22.8	8.61	1	1.3	NaN	NaN
claude-3-opus-20240229	21.6	9.72	1	1.3	NaN	NaN
microsoft-phi-2	21.5	8.36	1	1.3	NaN	NaN
Qwen-Qwen1.5-14B-Chat	21.4	8.92	1	1.3	NaN	NaN
Qwen-Qwen1.5-7B	20.1	7.42	1	1.3	NaN	NaN
gpt-4o-2024-05-13	20.1	9.52	1	1.3	NaN	NaN
mistralai-Mixtral-8x22B-Instruct-v0.1	19.9	10.8	1	1.3	NaN	NaN
mistralai-Mistral-7B-v0.3	19.7	7.48	1	1.3	NaN	NaN
google-gemma-1.1-7b-it	18.3	7.47	1	1.2	NaN	NaN
meta-llama-Llama-3-8b-chat-hf	17.8	7.77	1	1.2	NaN	NaN
deepseek-ai-deepseek-coder-1.3b-base	17.5	6.11	1	1.2	NaN	NaN
deepseek-ai-deepseek-V2-Lite	16.9	6.1	1	1.2	NaN	NaN
google-codegemma-1.1-2b	16.6	6.08	1	1.2	NaN	NaN
claude-3-haiku-20240307	16.3	6.6	1	1.2	NaN	NaN
Doubao-lite-4k	15.7	6.11	1	1.2	NaN	NaN
Salesforce-codegen25-7b-mono_P	15.6	5.94	1	1.1	NaN	NaN
google-codegemma-2b	13.3	4.58	1	1.1	NaN	NaN
Qwen-Qwen2-1.5B	11.8	4.42	1	1	NaN	NaN
meta-llama-Llama-2-13b-hf	11.6	3.95	1	1	NaN	NaN
google-gemma-7b-it	11.4	3.96	1	1	NaN	NaN
google-gemma-2b	10.3	3.11	1	0.96	NaN	NaN
microsoft-phi-1	9.1	2.7	1	0.91	NaN	NaN
ERNIE-Speed-8K	8.8	2.96	1	0.9	NaN	NaN
codellama-CodeLlama-70b-Instruct-hf	8.7	3.28	1	0.89	NaN	NaN
google-gemma-1.1-2b-it	8.5	2.98	1	0.88	NaN	NaN
microsoft-phi-1_5	8.3	2.75	1	0.87	NaN	NaN
codellama-CodeLlama-13b-Instruct-hf	7.9	2.86	1	0.85	NaN	NaN
meta-llama-Llama-2-7b-hf	6.9	2.11	1	0.8	NaN	NaN
mistralai-Mistral-7B-Instruct-v0.3	6.9	1.82	1	0.8	NaN	NaN
meta-llama-Llama-2-7b-chat-hf	6.4	1.92	1	0.77	NaN	NaN
google-gemma-2b-it	6	1.89	1	0.75	NaN	NaN
smallcloudai-Refact-1_6B-fim	5.7	1.97	1	0.73	NaN	NaN
codellama-CodeLlama-34b-Instruct-hf	5.2	1.53	1	0.7	NaN	NaN
Qwen-Qwen2-0.5B	3.9	0.808	1	0.61	NaN	NaN
mistralai-Mixtral-8x7B-Instruct-v0.1	3.7	1.16	1	0.6	NaN	NaN
meta-llama-Llama-2-70b-chat-hf	3.7	1.2	1	0.6	NaN	NaN