human_eval: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	87	47.1	3	2.6	2.4	1
qwen3-14b	85.4	46.3	3	2.8	2.3	1.5
google_gemma_3_12b_it	82.9	44.1	4	2.9	2.7	1.1
qwen3-32b	80.3	43	3	3.1	1.9	2.5
qwen2.5-coder-32b-instruct	79.3	43.2	2	3.2	1.9	2.5
qwen3-4b	78.8	41.6	4	3.2	2.6	1.8
qwen3-8b	78.7	42.5	3	3.2	2.3	2.2
qwen2.5-coder-14b-instruct	74	39.6	3	3.4	2	2.8
google_gemma_2_27b_it	73.2	37.6	2	3.5	2.9	1.8
mistralai_mixtral_8x22b_instruct_v0.1	71.1	36.1	3	3.5	2.5	2.5
google_gemma_3_4b_it	69.8	34.8	5	3.6	3.3	1.5
google_gemma_2_9b_it	63.3	30.9	4	3.8	3.4	1.7
llama-3.1-8B-instruct	59.8	28	7	3.8	3.8	0
qwen3-1.7b	58.8	28.9	4	3.8	3	2.4
qwen2-math-72b-instruct	57.9	28.8	2	3.9	2.4	3
google_codegemma_1.1_7b_it	54.9	25.5	5	3.9	3	2.5
qwen1.5-14b-chat	51.8	23.6	3	3.9	2.8	2.7
deepseek_r1_distill_qwen_32b	51.5	27.2	2	3.9	1.3	3.7
deepseek_r1_distill_qwen_14b	47.3	24.3	4	3.9	1.7	3.5
qwen2-72b-instruct	47	23.8	2	3.9	1.9	3.4
qwen1.5-32b-chat	46.7	21.7	3	3.9	2.8	2.7
deepseek_r1_distill_llama_70b	45.1	24.2	2	3.9	1.6	3.6
llama-3.2-3B-instruct	42.1	18.2	10	3.9	3.9	0
qwen1.5-72b-chat	41.8	18.3	2	3.9	2.9	2.6
google_gemma_3_1b_it	40.1	17.6	4	3.8	3.4	1.8
qwen2-7b-instruct	39.6	18.6	4	3.8	1.9	3.3
deepseek_v2_lite_chat	39.2	17.9	3	3.8	2	3.2
qwen2.5-coder-3b-instruct	38.6	18	4	3.8	2.1	3.2
qwen2.5-coder-7b-instruct	38.6	17.9	4	3.8	1.9	3.3
mistralai_mistral_7b_instruct_v0.3	37.7	15.3	4	3.8	3	2.3
qwen1.5-7b-chat	36.4	15.2	3	3.8	2.7	2.7
mistralai_ministral_8b_instruct_2410	35.7	15.9	4	3.7	1.9	3.2
mistralai_mixtral_8x7b_instruct_v0.1	35	15.6	3	3.7	2.6	2.7
mistralai_mathstral_7b_v0.1	30.6	13.1	4	3.6	2.1	3
qwen2.5-coder-1.5b-instruct	28.8	12	4	3.5	2.1	2.8
deepseek_r1_distill_llama_8b	27.4	13	4	3.5	1.6	3.1
qwen2-math-7b-instruct	25	11.2	2	3.4	1.6	3
google_gemma_7b_it	24.4	9.48	4	3.4	2.7	2
qwen2.5-coder-0.5b-instruct	22.9	9.36	5	3.3	1.9	2.7
deepseek_r1_distill_qwen_7b	22	9.93	4	3.2	1.4	2.9
llama-3.2-1B-instruct	22	8.34	13	3.2	3.2	0
mistralai_mistral_7b_instruct_v0.1	20.1	7.54	4	3.1	1.7	2.6
qwen3-0.6b	18.8	6.99	5	3	1.9	2.4
google_gemma_2b_it	17.7	5.79	4	3	2.4	1.7
mistralai_mistral_7b_instruct_v0.2	11.7	4.32	4	2.5	1.2	2.2
qwen2-1.5b-instruct	8.38	3.32	4	2.2	0.44	2.1
qwen1.5-1.8b-chat	4.47	1.61	3	1.6	0.93	1.3
qwen2-0.5b-instruct	3.78	0.958	5	1.5	0.81	1.2
qwen2-math-1.5b-instruct	2.24	0.891	3	1.2	0.31	1.1
qwen1.5-0.5b-chat	1.83	0.367	5	1	0.62	0.84
deepseek_r1_distill_qwen_1.5b	1.37	0.484	4	0.91	0	0.91