aime2024_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
deepseek_r1_distill_llama_70b	37.5	31.5	1.1e+03	8.8	6.7	5.7
deepseek_r1_distill_qwen_32b	36.6	30.7	1.1e+03	8.8	6.8	5.6
deepseek_r1_distill_qwen_14b	31.4	26	1.1e+03	8.5	6.3	5.7
google_gemma_3_27b_it	27.8	22.5	1e+03	8.2	6.1	5.4
deepseek_r1_distill_qwen_7b	27.4	22.4	1.1e+03	8.1	6.1	5.4
qwen3-32b	24.1	19.1	1.1e+03	7.8	6.1	4.8
qwen3-14b	23.7	18.9	1.1e+03	7.8	6.1	4.8
qwen3-8b	21.8	17.2	1.1e+03	7.5	5.8	4.8
google_gemma_3_12b_it	20.5	15.9	1.1e+03	7.4	5.8	4.6
deepseek_r1_distill_llama_8b	20.4	16.4	1.1e+03	7.4	5.1	5.3
llama-3.1-70B-instruct	19.2	15.5	1.1e+03	7.2	5.3	4.9
qwen2-math-72b-instruct	16.7	12.9	1.6e+02	6.8	5	4.6
qwen3-4b	16.5	12.3	1.1e+03	6.8	5.2	4.3
deepseek_r1_distill_qwen_1.5b	14.3	11	1.1e+03	6.4	4.4	4.7
qwen2.5-coder-32b-instruct	13.3	9.78	1.1e+03	6.2	4.7	4
qwen3-1.7b	9.26	6.78	1.1e+03	5.3	3.8	3.7
qwen2.5-coder-14b-instruct	8.04	5.88	1.1e+03	5	3	3.9
google_gemma_3_4b_it	7.02	5.12	1.1e+03	4.7	2.7	3.8
llama-3.2-3B-instruct	5.76	4.45	1.1e+03	4.3	2	3.7
qwen2-72b-instruct	5.09	3.56	1.1e+03	4	2.4	3.2
qwen2.5-coder-7b-instruct	5.02	3.56	1.1e+03	4	2.1	3.4
llama-3.1-8B-instruct	4.84	3.58	1.1e+03	3.9	2	3.4
google_gemma_2_27b_it	4.12	2.71	1.1e+03	3.6	2.5	2.7
mistralai_mathstral_7b_v0.1	2.16	1.52	1.1e+03	2.7	1.1	2.4
qwen2-7b-instruct	1.8	1.28	1.1e+03	2.4	0.84	2.3
mistralai_ministral_8b_instruct_2410	1.65	1.22	1.1e+03	2.3	0.94	2.1
qwen2.5-coder-3b-instruct	1.62	1.22	1.1e+03	2.3	0.68	2.2
mistralai_mixtral_8x22b_instruct_v0.1	1.49	1.03	1.1e+03	2.2	0.71	2.1
qwen1.5-32b-chat	1.41	1.02	1.1e+03	2.2	0.64	2.1
google_gemma_2_9b_it	1.39	0.927	1.1e+03	2.1	1.1	1.9
llama-3.2-1B-instruct	1.36	1.11	1.1e+03	2.1	0.61	2
qwen3-0.6b	1.2	0.864	1.1e+03	2	0.85	1.8
qwen1.5-72b-chat	1.05	0.823	1.1e+03	1.9	0.42	1.8
qwen1.5-14b-chat	0.861	0.734	1.1e+03	1.7	0.38	1.6
qwen1.5-7b-chat	0.785	0.572	1.1e+03	1.6	0.57	1.5
qwen2.5-coder-1.5b-instruct	0.733	0.582	1.1e+03	1.6	0.33	1.5
mistralai_mixtral_8x7b_instruct_v0.1	0.4	0.309	1.1e+03	1.2	0.13	1.1
google_codegemma_1.1_7b_it	0.382	0.321	1.1e+03	1.1	0.13	1.1
deepseek_v2_lite_chat	0.376	0.348	1.1e+03	1.1	0.19	1.1
google_gemma_3_1b_it	0.37	0.31	1.1e+03	1.1	0.16	1.1
mistralai_mistral_7b_instruct_v0.3	0.206	0.169	1.1e+03	0.83	0.074	0.82
mistralai_mistral_7b_instruct_v0.2	0.161	0.138	1.1e+03	0.73	0.061	0.73
qwen2.5-coder-0.5b-instruct	0.155	0.121	1.1e+03	0.72	0.079	0.71
mistralai_mistral_7b_instruct_v0.1	0.1	0.0953	1.1e+03	0.58	0.055	0.57
qwen2-1.5b-instruct	0.0667	0.0584	1.1e+03	0.47	0.017	0.47
google_gemma_7b_it	0.0545	0.0492	1.1e+03	0.43	0.023	0.43
qwen2-0.5b-instruct	0.0364	0.0302	1.1e+03	0.35	0.01	0.35
qwen1.5-0.5b-chat	0.0273	0.0227	1.1e+03	0.3	0.01	0.3
qwen1.5-1.8b-chat	0.0182	0.0153	1.1e+03	0.25	0.0027	0.25
google_gemma_2b_it	0	0	1.1e+03	0	0	0