math500_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	86.5	43.3	8	1.5	1.2	0.9
deepseek_r1_distill_qwen_32b	84.8	42.1	1.1e+03	1.6	1.2	1.1
qwen3-32b	82.3	39.9	7e+02	1.7	1.3	1.1
qwen3-14b	82.3	39.8	1.1e+03	1.7	1.4	1
deepseek_r1_distill_llama_70b	82.1	40.1	1.1e+03	1.7	1.3	1.2
qwen2-math-72b-instruct	81.1	38.7	35	1.7	1.5	0.98
google_gemma_3_12b_it	80.2	38.3	1.1e+03	1.8	1.5	1
qwen3-8b	79.7	37.8	1.1e+03	1.8	1.4	1.1
qwen3-4b	78.2	36.8	1.1e+03	1.8	1.5	1.1
deepseek_r1_distill_qwen_7b	78.1	37.2	1.1e+03	1.8	1.4	1.2
qwen2.5-coder-32b-instruct	77	35.7	1.1e+03	1.9	1.5	1.1
qwen2.5-coder-14b-instruct	72.6	32.7	1.1e+03	2	1.6	1.2
deepseek_r1_distill_llama_8b	70.5	32.8	71	2	1.5	1.4
deepseek_r1_distill_qwen_1.5b	68.7	30.7	1.1e+03	2.1	1.5	1.4
qwen3-1.7b	67.9	29.9	1.1e+03	2.1	1.6	1.3
llama-3.1-70B-instruct	66.4	29.1	1e+03	2.1	1.7	1.3
google_gemma_3_4b_it	65.7	29.4	1.1e+03	2.1	1.7	1.3
qwen2-72b-instruct	65.3	27.9	2.2e+02	2.1	1.7	1.3
qwen2.5-coder-7b-instruct	62.4	26.5	1.1e+03	2.2	1.6	1.5
google_gemma_2_27b_it	53.1	20.8	1.1e+03	2.2	1.9	1.2
qwen2-7b-instruct	52.5	20.5	1.1e+03	2.2	1.7	1.5
mistralai_ministral_8b_instruct_2410	49.3	19	1.1e+03	2.2	1.6	1.5
mistralai_mathstral_7b_v0.1	48.5	18.6	1.1e+03	2.2	1.6	1.5
llama-3.1-8B-instruct	48.4	18.7	1.1e+03	2.2	1.7	1.5
mistralai_mixtral_8x22b_instruct_v0.1	47.8	18.4	1e+02	2.2	1.6	1.5
qwen2.5-coder-3b-instruct	47.4	18.7	1.1e+03	2.2	1.5	1.7
google_gemma_2_9b_it	47.4	17.6	1.1e+03	2.2	1.9	1.2
llama-3.2-3B-instruct	44.5	16.7	1.1e+03	2.2	1.7	1.4
qwen1.5-72b-chat	40.6	14.9	3.9e+02	2.2	1.6	1.5
qwen1.5-32b-chat	39.9	14.6	30	2.2	1.6	1.5
qwen3-0.6b	33.4	12.1	1.1e+03	2.1	1.4	1.5
qwen2.5-coder-1.5b-instruct	33.1	11.7	1.1e+03	2.1	1.4	1.6
qwen1.5-14b-chat	30.9	10.4	1.1e+03	2.1	1.5	1.5
llama-3.2-1B-instruct	26.6	8.92	1.1e+03	2	1.4	1.4
google_codegemma_1.1_7b_it	21.9	7.04	1.1e+03	1.9	1.3	1.3
deepseek_v2_lite_chat	21.4	6.98	1.1e+03	1.8	1.1	1.4
qwen1.5-7b-chat	17	5.27	1.1e+03	1.7	1.1	1.3
google_gemma_3_1b_it	14.5	6.01	1.1e+03	1.6	1	1.2
mistralai_mistral_7b_instruct_v0.3	13.9	4.34	1.1e+03	1.5	0.97	1.2
mistralai_mistral_7b_instruct_v0.2	10.4	3.22	9.2e+02	1.4	0.79	1.1
qwen2.5-coder-0.5b-instruct	8.66	2.77	1.1e+03	1.3	0.68	1.1
qwen2-1.5b-instruct	7.18	2.17	1.1e+03	1.2	0.51	1
mistralai_mistral_7b_instruct_v0.1	6.04	1.88	1.1e+03	1.1	0.48	0.95
google_gemma_7b_it	5.78	1.72	1.1e+03	1	0.68	0.79
qwen2-0.5b-instruct	3.34	1.13	1.1e+03	0.8	0.28	0.75
qwen1.5-1.8b-chat	1.53	0.539	1.1e+03	0.55	0.14	0.53
qwen1.5-0.5b-chat	0.92	0.386	8.9e+02	0.43	0.089	0.42
google_gemma_2b_it	0.196	0.0683	1.1e+03	0.2	0.041	0.19