human_eval: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	87.5	42.9	11	2.6	1.9	1.8
google_gemma_3_27b_it	86.5	41.4	12	2.7	2.5	0.79
qwen3-14b	86	41.7	12	2.7	2.3	1.5
qwen2.5-coder-32b-instruct	83.3	41	11	2.9	1.9	2.2
google_gemma_3_12b_it	82.8	38.7	11	2.9	2.7	1.1
qwen2.5-coder-14b-instruct	82.3	39.9	12	3	1.9	2.3
qwen3-4b	78.9	36.7	12	3.2	2.7	1.7
qwen3-8b	77.6	36.6	12	3.3	2.5	2.1
google_gemma_2_27b_it	75.2	33.7	10	3.4	3.1	1.4
mistralai_mixtral_8x22b_instruct_v0.1	74.2	33.4	11	3.4	2.6	2.2
qwen2-math-72b-instruct	70.6	31.8	11	3.6	2.4	2.6
deepseek_r1_distill_qwen_32b	70.2	33.8	11	3.6	1.8	3.1
google_gemma_3_4b_it	69.6	30.5	13	3.6	3.3	1.4
llama-3.1-8B-instruct	65.9	28	15	3.7	3.7	0
google_gemma_2_9b_it	63.1	26.6	11	3.8	3.5	1.4
deepseek_r1_distill_qwen_14b	58.6	27.7	11	3.8	2	3.3
qwen2-7b-instruct	57.7	24.7	11	3.9	2.3	3.1
qwen3-1.7b	57.1	23.9	12	3.9	3.2	2.1
google_codegemma_1.1_7b_it	56.4	22.3	13	3.9	3.1	2.3
qwen2-72b-instruct	55	25.6	11	3.9	2	3.3
deepseek_r1_distill_llama_70b	54.4	25.1	11	3.9	2	3.3
qwen2.5-coder-7b-instruct	54	23.5	10	3.9	2.1	3.3
qwen1.5-14b-chat	52.1	20.3	12	3.9	3.1	2.4
qwen2.5-coder-3b-instruct	50	21.3	12	3.9	2.2	3.2
deepseek_v2_lite_chat	48.8	19.2	11	3.9	2.7	2.8
qwen1.5-32b-chat	48.2	19.1	11	3.9	3.2	2.3
mistralai_ministral_8b_instruct_2410	46.1	19.2	11	3.9	2.3	3.1
llama-3.2-3B-instruct	45.7	16.9	17	3.9	3.9	0
mistralai_mathstral_7b_v0.1	44	17.1	11	3.9	2.3	3.1
qwen2.5-coder-1.5b-instruct	43.1	16.6	11	3.9	2.3	3.1
deepseek_r1_distill_llama_8b	42.6	18.1	13	3.9	2.2	3.2
qwen1.5-72b-chat	42.6	16.6	11	3.9	3	2.5
mistralai_mistral_7b_instruct_v0.3	41.6	14.8	11	3.8	3.1	2.2
google_gemma_3_1b_it	41.3	15.4	13	3.8	3.6	1.4
qwen1.5-7b-chat	39.8	14.4	12	3.8	2.9	2.4
mistralai_mixtral_8x7b_instruct_v0.1	38.7	14.9	12	3.8	2.8	2.6
qwen2.5-coder-0.5b-instruct	36.4	13.6	13	3.8	2.3	3
qwen2-math-7b-instruct	32.2	12.4	6	3.6	2.4	2.8
deepseek_r1_distill_qwen_7b	31.8	13	11	3.6	2	3
llama-3.2-1B-instruct	30.5	10.6	12	3.6	3.6	0
mistralai_mistral_7b_instruct_v0.1	28.3	9.4	11	3.5	2.4	2.6
google_gemma_7b_it	26.3	8.58	13	3.4	2.9	1.9
qwen3-0.6b	25.1	8.03	13	3.4	2.5	2.3
google_gemma_2b_it	17	4.73	13	2.9	2.4	1.6
qwen2-1.5b-instruct	16.4	6.09	13	2.9	1.2	2.6
qwen2-0.5b-instruct	10.3	2.82	13	2.4	1.4	1.9
mistralai_mistral_7b_instruct_v0.2	9.15	2.69	10	2.3	1.2	1.9
qwen1.5-1.8b-chat	7.43	2.09	11	2	1.1	1.7
qwen2-math-1.5b-instruct	3.96	1.45	4	1.5	0.98	1.2
qwen1.5-0.5b-chat	2.91	0.517	13	1.3	0.75	1.1
deepseek_r1_distill_qwen_1.5b	2.67	0.804	13	1.3	0.39	1.2