aime2024_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
deepseek_r1_distill_llama_70b	38.3	33.3	2	8.9	5.4	7.1
deepseek_r1_distill_qwen_32b	35	30	2	8.7	6.1	6.2
google_gemma_3_27b_it	27.8	23.1	3	8.2	5.5	6.1
deepseek_r1_distill_qwen_7b	25.8	21.8	4	8	6.1	5.2
qwen3-32b	25.6	20.8	3	8	5.8	5.4
qwen3-14b	23.3	18.8	3	7.7	6.7	3.8
deepseek_r1_distill_qwen_14b	21.7	18	4	7.5	5.5	5.1
google_gemma_3_12b_it	19.2	15.1	4	7.2	6.1	3.7
qwen3-8b	17.8	14.2	3	7	5.1	4.7
qwen3-4b	17.5	13.6	4	6.9	5.9	3.7
llama-3.1-70B-instruct	16.7	13.6	4	6.8	6.8	0
deepseek_r1_distill_qwen_1.5b	13.3	10.4	4	6.2	4.9	3.8
deepseek_r1_distill_llama_8b	13.3	10.4	4	6.2	4.5	4.3
qwen2.5-coder-32b-instruct	11.7	9.03	2	5.9	2.6	5.3
google_gemma_3_4b_it	8	5.68	5	5	4	3
qwen2.5-coder-14b-instruct	7.78	6.31	3	4.9	4.5	1.9
llama-3.2-3B-instruct	6.67	4.59	10	4.6	4.6	0
llama-3.1-8B-instruct	6.67	6.31	7	4.6	4.6	0
qwen2.5-coder-7b-instruct	4.17	2.97	4	3.6	2.2	2.9
qwen3-1.7b	4.17	2.87	4	3.6	3.2	1.7
qwen2-math-7b-instruct	3.33	2.26	2	3.3	0	3.3
qwen2-72b-instruct	3.33	2.68	2	3.3	0	3.3
qwen2-math-1.5b-instruct	3.33	2.32	3	3.3	3.3	0
llama-3.2-1B-instruct	3.33	3	13	3.3	3.3	0
mistralai_mixtral_8x7b_instruct_v0.1	2.22	2.14	3	2.7	0	2.7
google_gemma_2_27b_it	1.67	1.14	2	2.3	0	2.4
google_gemma_3_1b_it	1.67	1.53	4	2.3	0	2.4
mistralai_mathstral_7b_v0.1	1.67	1.43	4	2.3	1.3	1.9
qwen2.5-coder-3b-instruct	1.67	1.28	4	2.3	0	2.4
qwen1.5-14b-chat	1.11	0.757	3	1.9	0	1.9
mistralai_mixtral_8x22b_instruct_v0.1	1.11	0.743	3	1.9	0	1.9
google_gemma_2_9b_it	0.833	0.556	4	1.7	0	1.7
qwen2.5-coder-1.5b-instruct	0.833	0.71	4	1.7	0	1.7
qwen2.5-coder-0.5b-instruct	0.667	0.453	5	1.5	0	1.5
google_codegemma_1.1_7b_it	0.667	0.453	5	1.5	0	1.5
google_gemma_7b_it	0	0	4	0	0	0
google_gemma_2b_it	0	0	4	0	0	0
deepseek_v2_lite_chat	0	0	3	0	0	0
qwen1.5-7b-chat	0	0	3	0	0	0
qwen2-1.5b-instruct	0	0	4	0	0	0
qwen2-0.5b-instruct	0	0	5	0	0	0
qwen1.5-0.5b-chat	0	0	5	0	0	0
mistralai_mistral_7b_instruct_v0.2	0	0	4	0	0	0
mistralai_ministral_8b_instruct_2410	0	0	4	0	0	0
mistralai_mistral_7b_instruct_v0.3	0	0	4	0	0	0
mistralai_mistral_7b_instruct_v0.1	0	0	4	0	0	0
qwen1.5-72b-chat	0	0	2	0	0	0
qwen1.5-32b-chat	0	0	3	0	0	0
qwen1.5-1.8b-chat	0	0	3	0	0	0
qwen2-7b-instruct	0	0	4	0	0	0
qwen3-0.6b	0	0	5	0	0	0