human_eval_plus: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-14b	77.6	38	1.1e+03	3.3	2.9	1.4
qwen3-32b	77.3	37.7	1.1e+03	3.3	2.7	1.8
google_gemma_3_27b_it	75.7	36.2	7	3.3	3.2	0.93
qwen2.5-coder-32b-instruct	75	36.8	1.1e+03	3.4	2.7	2.1
qwen2.5-coder-14b-instruct	74.8	36.6	1.1e+03	3.4	2.6	2.1
google_gemma_3_12b_it	72.9	34	1.1e+03	3.5	3.3	1.1
llama-3.1-70B-instruct	70.7	33	1.1e+03	3.6	2.9	2
qwen3-8b	70.7	34.1	1.1e+03	3.6	2.9	2.1
qwen3-4b	69.7	32.3	1.1e+03	3.6	3.2	1.7
google_gemma_2_27b_it	66.4	29.8	1.1e+03	3.7	3.4	1.5
mistralai_mixtral_8x22b_instruct_v0.1	65.9	29.8	8.9e+02	3.7	2.9	2.3
deepseek_r1_distill_qwen_32b	62.7	30.5	1.1e+03	3.8	2.3	3
qwen2-math-72b-instruct	61.7	27.8	1.4e+02	3.8	2.8	2.5
google_gemma_3_4b_it	61	27.1	1.1e+03	3.8	3.6	1.4
google_gemma_2_9b_it	55.2	23.4	1.1e+03	3.9	3.6	1.5
llama-3.1-8B-instruct	54.8	22.9	1.1e+03	3.9	3.1	2.4
deepseek_r1_distill_qwen_14b	51.9	24.7	1.1e+03	3.9	2.3	3.1
qwen2-7b-instruct	50.1	21.6	1.1e+03	3.9	2.5	3
deepseek_r1_distill_llama_70b	49.9	23.6	1.1e+03	3.9	2.3	3.2
qwen3-1.7b	49.8	21.2	1.1e+03	3.9	3.3	2.1
qwen2-72b-instruct	48.4	22.1	1.1e+03	3.9	2.4	3.1
google_codegemma_1.1_7b_it	47.4	18.7	1.1e+03	3.9	3.2	2.3
qwen2.5-coder-7b-instruct	47.4	20.8	1.1e+03	3.9	2.3	3.2
qwen1.5-14b-chat	44.7	17.7	1.1e+03	3.9	3	2.4
qwen2.5-coder-3b-instruct	44.3	19.2	1.1e+03	3.9	2.3	3.1
llama-3.2-3B-instruct	44.2	17.4	1.1e+03	3.9	2.9	2.5
deepseek_v2_lite_chat	42.2	16.8	1.1e+03	3.9	2.8	2.7
qwen1.5-32b-chat	41.6	16.6	1.1e+03	3.8	3.2	2.2
mistralai_ministral_8b_instruct_2410	40.1	16.5	1.1e+03	3.8	2.4	3
deepseek_r1_distill_llama_8b	37.9	16.1	1.1e+03	3.8	2.3	3
qwen1.5-72b-chat	36.9	14.4	1.1e+03	3.8	3	2.3
mistralai_mathstral_7b_v0.1	36.3	14	1.1e+03	3.8	2.4	2.9
google_gemma_3_1b_it	36.1	13.8	1.1e+03	3.8	3.5	1.3
qwen2.5-coder-1.5b-instruct	35.3	13.8	1.1e+03	3.7	2.3	3
qwen1.5-7b-chat	33.2	12.2	1.1e+03	3.7	2.8	2.4
mistralai_mistral_7b_instruct_v0.3	31.7	11	1.1e+03	3.6	2.8	2.3
qwen2.5-coder-0.5b-instruct	30.7	11.8	1.1e+03	3.6	2.2	2.8
deepseek_r1_distill_qwen_7b	28.3	11.7	1.1e+03	3.5	2	2.9
llama-3.2-1B-instruct	25.9	8.72	1.1e+03	3.4	2.5	2.3
mistralai_mistral_7b_instruct_v0.1	22.4	7.24	1.1e+03	3.3	2.3	2.3
qwen3-0.6b	21.2	6.99	1.1e+03	3.2	2.3	2.2
google_gemma_7b_it	20.4	6.5	1.1e+03	3.1	2.6	1.8
qwen2-1.5b-instruct	14	5.2	1.1e+03	2.7	1.2	2.4
google_gemma_2b_it	13.7	3.6	1.1e+03	2.7	2.3	1.4
mistralai_mistral_7b_instruct_v0.2	7.76	2.45	1.1e+03	2.1	1.2	1.7
qwen2-0.5b-instruct	7.41	2.12	1.1e+03	2	1.1	1.7
qwen1.5-1.8b-chat	5.42	1.52	1.1e+03	1.8	0.93	1.5
deepseek_r1_distill_qwen_1.5b	2.32	0.722	1.1e+03	1.2	0.38	1.1
qwen1.5-0.5b-chat	1.79	0.302	1e+03	1	0.64	0.81