humaneval+: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
claude-3-opus-20240229	77.4	25.3	1	3.3	NaN	NaN
deepseek-coder-33b-instruct	76.2	24.1	1	3.3	NaN	NaN
opencodeinterpreter-ds-33b	74.4	23.8	1	3.4	NaN	NaN
mixtral-8x22b-instruct-v0.1	73.8	23.4	1	3.4	NaN	NaN
speechless-codellama-34b	72.6	22	1	3.5	NaN	NaN
HuggingFaceH4--starchat2-15b-v0.1	72	21.9	1	3.5	NaN	NaN
code-millenials-34b	72	22.3	1	3.5	NaN	NaN
deepseek-coder-6.7b-instruct	72	23.2	1	3.5	NaN	NaN
meta-llama-3-70b-instruct	72	21.9	1	3.5	NaN	NaN
deepseek-coder-7b-instruct-v1.5	71.3	21.6	1	3.5	NaN	NaN
gpt-3.5-turbo	70.7	20.8	1	3.6	NaN	NaN
opencodeinterpreter-ds-6.7b	70.7	21.4	1	3.6	NaN	NaN
xwincoder-34b	70.1	21.2	1	3.6	NaN	NaN
claude-3-haiku-20240307	68.9	20.5	1	3.6	NaN	NaN
openchat	68.9	20.5	1	3.6	NaN	NaN
speechless-coder-ds-6.7b	66.5	17.9	1	3.7	NaN	NaN
code-llama-70b-instruct	66.5	19.6	1	3.7	NaN	NaN
white-rabbit-neo-33b-v1	65.9	19	1	3.7	NaN	NaN
codebooga-34b	65.9	17.7	1	3.7	NaN	NaN
claude-3-sonnet-20240229	64.6	18.8	1	3.7	NaN	NaN
mistral-large-latest	63.4	18.1	1	3.8	NaN	NaN
speechless-starcoder2-15b	63.4	16.8	1	3.8	NaN	NaN
deepseek-coder-1.3b-instruct	61.6	16.2	1	3.8	NaN	NaN
bigcode--starcoder2-15b-instruct-v0.1	61	15.8	1	3.8	NaN	NaN
Qwen--Qwen1.5-72B-Chat	59.8	15.8	1	3.8	NaN	NaN
microsoft--Phi-3-mini-4k-instruct	59.8	16	1	3.8	NaN	NaN
code-13b	53.7	13.1	1	3.9	NaN	NaN
codegemma-7b-it	53	11.7	1	3.9	NaN	NaN
speechless-coding-7b-16k-tora	52.4	12.2	1	3.9	NaN	NaN
speechless-starcoder2-7b	51.8	11.8	1	3.9	NaN	NaN
wizardcoder-15b	50.6	11	1	3.9	NaN	NaN
open-hermes-2.5-code-290k-13b	50.6	10.8	1	3.9	NaN	NaN
code-33b	50	11.8	1	3.9	NaN	NaN
phi-2	45.7	10.6	1	3.9	NaN	NaN
wizardcoder-7b	45.7	9.72	1	3.9	NaN	NaN
code-llama-multi-34b	43.9	8.78	1	3.9	NaN	NaN
deepseek-coder-33b	43.9	10.6	1	3.9	NaN	NaN
mistral-7b-codealpaca	43.3	9.46	1	3.9	NaN	NaN
starcoder2-15b-oci	43.3	8.89	1	3.9	NaN	NaN
speechless-mistral-7b	42.7	7.81	1	3.9	NaN	NaN
codegemma-7b	42.1	11.4	1	3.9	NaN	NaN
mixtral-8x7b-instruct	40.9	9.08	1	3.8	NaN	NaN
solar-10.7b-instruct	37.8	7.08	1	3.8	NaN	NaN
mistralai--Mistral-7B-Instruct-v0.2	36.6	6.99	1	3.8	NaN	NaN
gemma-1.1-7b-it	36	5.83	1	3.7	NaN	NaN
code-llama-multi-13b	34.8	6.12	1	3.7	NaN	NaN
octocoder	33.5	6.44	1	3.7	NaN	NaN
xdan-l1-chat	32.9	5.92	1	3.7	NaN	NaN
python-code-13b	31.7	5.74	1	3.6	NaN	NaN