aime2025_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
deepseek_r1_distill_llama_70b	26.7	22.2	2	8.1	6.6	4.7
deepseek_r1_distill_qwen_14b	23.3	19.1	4	7.7	7.4	2.4
deepseek_r1_distill_qwen_7b	23.3	19.1	4	7.7	7.4	2.4
deepseek_r1_distill_qwen_32b	21.7	17.8	2	7.5	7.1	2.4
qwen3-8b	21.1	17.1	3	7.5	6.1	4.3
google_gemma_3_27b_it	21.1	17.3	3	7.5	6.7	3.3
qwen3-14b	18.9	15.6	3	7.1	5.7	4.3
qwen2.5-coder-32b-instruct	18.3	15.4	2	7.1	5.8	4.1
google_gemma_3_12b_it	17.5	14	4	6.9	6	3.5
qwen3-32b	16.7	13.7	3	6.8	5.3	4.3
deepseek_r1_distill_llama_8b	15.8	12.4	4	6.7	6.5	1.7
qwen3-4b	14.2	11	4	6.4	5.2	3.7
qwen2-math-7b-instruct	13.3	10.7	2	6.2	4	4.7
deepseek_r1_distill_qwen_1.5b	11.7	9.51	4	5.9	5.4	2.4
google_gemma_3_4b_it	10.7	8.3	5	5.6	4.4	3.5
qwen3-1.7b	7.5	6.07	4	4.8	2.7	4
qwen2-math-72b-instruct	5	3.94	2	4	0	4.1
llama-3.2-3B-instruct	3.33	3.3	10	3.3	3.3	0
google_gemma_2_27b_it	3.33	2.63	2	3.3	0	3.3
qwen2.5-coder-14b-instruct	3.33	2.38	3	3.3	3.3	0
qwen2-math-1.5b-instruct	3.33	2.42	3	3.3	3.3	0
qwen2.5-coder-7b-instruct	1.67	1.31	4	2.3	0	2.4
google_codegemma_1.1_7b_it	1.33	1.33	5	2.1	1	1.8
mistralai_mixtral_8x7b_instruct_v0.1	1.11	1.11	3	1.9	0	1.9
mistralai_mixtral_8x22b_instruct_v0.1	1.11	0.98	3	1.9	0	1.9
mistralai_ministral_8b_instruct_2410	0.833	0.821	4	1.7	0	1.7
google_gemma_2_9b_it	0.833	0.821	4	1.7	0	1.7
qwen3-0.6b	0.667	0.586	5	1.5	0	1.5
llama-3.2-1B-instruct	0	0	13	0	0	0
llama-3.1-8B-instruct	0	0	7	0	0	0
llama-3.1-70B-instruct	0	0	4	0	0	0
google_gemma_7b_it	0	0	4	0	0	0
google_gemma_3_1b_it	0	0	4	0	0	0
google_gemma_2b_it	0	0	4	0	0	0
deepseek_v2_lite_chat	0	0	3	0	0	0
mistralai_mathstral_7b_v0.1	0	0	4	0	0	0
qwen2-72b-instruct	0	0	2	0	0	0
qwen2-1.5b-instruct	0	0	4	0	0	0
qwen2-0.5b-instruct	0	0	5	0	0	0
qwen1.5-7b-chat	0	0	3	0	0	0
qwen1.5-72b-chat	0	0	2	0	0	0
qwen1.5-32b-chat	0	0	3	0	0	0
qwen1.5-14b-chat	0	0	3	0	0	0
qwen1.5-1.8b-chat	0	0	3	0	0	0
qwen1.5-0.5b-chat	0	0	5	0	0	0
mistralai_mistral_7b_instruct_v0.3	0	0	4	0	0	0
mistralai_mistral_7b_instruct_v0.1	0	0	4	0	0	0
mistralai_mistral_7b_instruct_v0.2	0	0	4	0	0	0
qwen2.5-coder-0.5b-instruct	0	0	5	0	0	0
qwen2.5-coder-1.5b-instruct	0	0	4	0	0	0
qwen2-7b-instruct	0	0	4	0	0	0
qwen2.5-coder-3b-instruct	0	0	4	0	0	0