human_eval_plus: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-14b	77.2	44.4	3	3.3	2.8	1.7
google_gemma_3_27b_it	75	41.8	3	3.4	3.2	1.2
qwen2.5-coder-32b-instruct	74.1	42.6	2	3.4	2.5	2.3
qwen3-32b	73.4	41.7	3	3.5	2.6	2.3
google_gemma_3_12b_it	72.9	40.2	4	3.5	3.3	1.2
qwen3-4b	69.4	38.1	4	3.6	3.1	1.8
qwen3-8b	68.9	38.7	3	3.6	2.8	2.3
qwen2.5-coder-14b-instruct	68.1	38.7	3	3.6	2.3	2.9
google_gemma_2_27b_it	66.8	35.9	2	3.7	3.2	1.8
google_gemma_3_4b_it	60.9	32.1	5	3.8	3.5	1.5
mistralai_mixtral_8x22b_instruct_v0.1	60	31.4	3	3.8	2.7	2.7
google_gemma_2_9b_it	56.1	29	4	3.9	3.5	1.7
llama-3.1-8B-instruct	50.6	24.8	7	3.9	3.9	0
qwen3-1.7b	50.5	25.6	4	3.9	3.2	2.2
deepseek_r1_distill_qwen_32b	47	25.4	2	3.9	1.6	3.6
qwen2-math-72b-instruct	45.7	24.4	2	3.9	2.3	3.1
google_codegemma_1.1_7b_it	44.9	21.4	5	3.9	3.1	2.4
qwen2-72b-instruct	43.3	23.2	2	3.9	1.8	3.4
deepseek_r1_distill_qwen_14b	43.1	23.8	4	3.9	1.8	3.4
deepseek_r1_distill_llama_70b	43	24.1	2	3.9	2	3.3
qwen1.5-14b-chat	42.3	19.9	3	3.9	2.9	2.5
qwen1.5-32b-chat	41.5	20.7	3	3.8	2.8	2.6
deepseek_v2_lite_chat	36.2	17	3	3.8	2.6	2.7
qwen2-7b-instruct	35.5	17.9	4	3.7	1.9	3.2
google_gemma_3_1b_it	35.5	16.4	4	3.7	3.5	1.3
qwen2.5-coder-7b-instruct	35.4	17.5	4	3.7	2.1	3.1
qwen1.5-72b-chat	34.1	16.1	2	3.7	2.5	2.7
qwen2.5-coder-3b-instruct	33.7	17.4	4	3.7	1.8	3.2
llama-3.2-3B-instruct	33.5	15.1	10	3.7	3.7	0
mistralai_ministral_8b_instruct_2410	32	15.7	4	3.6	1.9	3.1
mistralai_mixtral_8x7b_instruct_v0.1	30.3	14.2	3	3.6	2.7	2.4
mistralai_mistral_7b_instruct_v0.3	30.2	13.2	4	3.6	2.6	2.5
qwen1.5-7b-chat	29.7	13.3	3	3.6	2.5	2.5
deepseek_r1_distill_llama_8b	24.4	12.2	4	3.4	1.6	2.9
mistralai_mathstral_7b_v0.1	23.8	10.4	4	3.3	1.8	2.8
qwen2-math-7b-instruct	23.5	10.9	2	3.3	2.1	2.6
qwen2.5-coder-1.5b-instruct	22.4	10.3	4	3.3	1.6	2.8
deepseek_r1_distill_qwen_7b	20.7	10.2	4	3.2	1.5	2.8
google_gemma_7b_it	20.4	8.83	4	3.1	2.5	1.9
llama-3.2-1B-instruct	20.1	8.17	13	3.1	3.1	0
qwen2.5-coder-0.5b-instruct	18.4	8.18	5	3	1.4	2.7
qwen3-0.6b	17.1	7.15	5	2.9	1.9	2.3
mistralai_mistral_7b_instruct_v0.1	13.6	5.29	4	2.7	1.5	2.2
google_gemma_2b_it	13.3	4.63	4	2.6	2.2	1.5
mistralai_mistral_7b_instruct_v0.2	7.93	2.95	4	2.1	1.1	1.8
qwen2-1.5b-instruct	5.79	2.42	4	1.8	0.54	1.7
qwen1.5-1.8b-chat	3.66	1.27	3	1.5	0.73	1.3
qwen2-0.5b-instruct	2.68	0.851	5	1.3	0.57	1.1
qwen2-math-1.5b-instruct	1.63	0.621	3	0.99	0.69	0.7
qwen1.5-0.5b-chat	1.22	0.356	5	0.86	0.42	0.75
deepseek_r1_distill_qwen_1.5b	0.457	0.162	4	0.53	0	0.53