aime2025_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
deepseek_r1_distill_qwen_32b	26.6	22	1.1e+03	8.1	6.8	4.3
google_gemma_3_27b_it	25.1	20.5	9.5e+02	7.9	6.7	4.2
deepseek_r1_distill_llama_70b	25	20.6	1.1e+03	7.9	6.7	4.2
deepseek_r1_distill_qwen_14b	24	19.5	1.1e+03	7.8	6.7	4
deepseek_r1_distill_qwen_7b	22.5	18.1	1.1e+03	7.6	6.6	3.8
qwen3-14b	20	16.1	1.1e+03	7.3	5.8	4.5
qwen3-8b	18.6	14.8	1.1e+03	7.1	5.5	4.5
qwen3-32b	18.5	15	1.1e+03	7.1	5.3	4.7
deepseek_r1_distill_llama_8b	18.3	14.5	1.1e+03	7.1	5.8	4
google_gemma_3_12b_it	17.3	13.5	1.1e+03	6.9	6	3.3
qwen3-4b	17.1	13.6	1.1e+03	6.9	5.3	4.4
deepseek_r1_distill_qwen_1.5b	15	11.7	1.1e+03	6.5	5	4.2
qwen2-math-72b-instruct	11.5	9.23	1.4e+02	5.8	3.9	4.4
qwen2.5-coder-32b-instruct	11.4	8.99	1.1e+03	5.8	4.1	4.1
google_gemma_3_4b_it	10.8	8.25	1.1e+03	5.7	4.5	3.5
qwen3-1.7b	8.55	6.75	1.1e+03	5.1	3.3	3.9
qwen2.5-coder-14b-instruct	7.95	6.12	1.1e+03	4.9	2.9	4
qwen2.5-coder-7b-instruct	3.14	2.42	1.1e+03	3.2	1.1	3
qwen2-72b-instruct	2.62	2.13	1.1e+03	2.9	0.89	2.8
llama-3.1-70B-instruct	2.5	1.99	1.1e+03	2.8	1.1	2.6
qwen2.5-coder-3b-instruct	1.39	1.14	1.1e+03	2.1	0.53	2.1
mistralai_ministral_8b_instruct_2410	1.17	0.955	1.1e+03	2	0.45	1.9
qwen3-0.6b	1.08	0.87	1.1e+03	1.9	0.45	1.8
mistralai_mixtral_8x22b_instruct_v0.1	1.02	0.861	1.1e+03	1.8	0.37	1.8
google_gemma_2_27b_it	0.836	0.682	1.1e+03	1.7	0.41	1.6
llama-3.1-8B-instruct	0.806	0.697	1.1e+03	1.6	0.24	1.6
llama-3.2-3B-instruct	0.761	0.642	1.1e+03	1.6	0.35	1.5
mistralai_mathstral_7b_v0.1	0.676	0.585	1.1e+03	1.5	0.19	1.5
qwen2-7b-instruct	0.552	0.465	1.1e+03	1.4	0.22	1.3
qwen1.5-72b-chat	0.418	0.381	1.1e+03	1.2	0.11	1.2
qwen1.5-32b-chat	0.409	0.362	1.1e+03	1.2	0.12	1.2
google_gemma_2_9b_it	0.306	0.245	1.1e+03	1	0.18	0.99
mistralai_mixtral_8x7b_instruct_v0.1	0.306	0.278	1.1e+03	1	0.092	1
google_gemma_3_1b_it	0.276	0.207	1.1e+03	0.96	0.14	0.95
qwen2.5-coder-1.5b-instruct	0.255	0.229	1.1e+03	0.92	0.067	0.92
qwen1.5-14b-chat	0.206	0.185	1.1e+03	0.83	0.059	0.83
deepseek_v2_lite_chat	0.194	0.175	1.1e+03	0.8	0.042	0.8
google_codegemma_1.1_7b_it	0.161	0.151	1.1e+03	0.73	0.061	0.73
qwen2.5-coder-0.5b-instruct	0.161	0.153	1.1e+03	0.73	0.085	0.73
mistralai_mistral_7b_instruct_v0.3	0.155	0.13	1.1e+03	0.72	0.063	0.71
qwen1.5-7b-chat	0.13	0.115	1.1e+03	0.66	0.032	0.66
google_gemma_7b_it	0.124	0.104	1.1e+03	0.64	0.042	0.64
llama-3.2-1B-instruct	0.124	0.112	1.1e+03	0.64	0.026	0.64
mistralai_mistral_7b_instruct_v0.2	0.097	0.0913	1.1e+03	0.57	0.039	0.57
qwen1.5-0.5b-chat	0.0727	0.0665	1.1e+03	0.49	0.032	0.49
qwen2-0.5b-instruct	0.0545	0.0526	1.1e+03	0.43	0.017	0.43
qwen1.5-1.8b-chat	0.0455	0.0443	1.1e+03	0.39	0.016	0.39
qwen2-1.5b-instruct	0.0333	0.0292	1.1e+03	0.33	0.006	0.33
mistralai_mistral_7b_instruct_v0.1	0.0303	0.0293	1.1e+03	0.32	0.0099	0.32
google_gemma_2b_it	0.00909	0.00806	1.1e+03	0.17	0	0.17