aime2025_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
deepseek_r1_distill_qwen_32b 26.6 22 1.1e+03 8.1 6.8 4.3
google_gemma_3_27b_it 25.1 20.5 9.5e+02 7.9 6.7 4.2
deepseek_r1_distill_llama_70b 25 20.6 1.1e+03 7.9 6.7 4.2
deepseek_r1_distill_qwen_14b 24 19.5 1.1e+03 7.8 6.7 4
deepseek_r1_distill_qwen_7b 22.5 18.1 1.1e+03 7.6 6.6 3.8
qwen3-14b 20 16.1 1.1e+03 7.3 5.8 4.5
qwen3-8b 18.6 14.8 1.1e+03 7.1 5.5 4.5
qwen3-32b 18.5 15 1.1e+03 7.1 5.3 4.7
deepseek_r1_distill_llama_8b 18.3 14.5 1.1e+03 7.1 5.8 4
google_gemma_3_12b_it 17.3 13.5 1.1e+03 6.9 6 3.3
qwen3-4b 17.1 13.6 1.1e+03 6.9 5.3 4.4
deepseek_r1_distill_qwen_1.5b 15 11.7 1.1e+03 6.5 5 4.2
qwen2-math-72b-instruct 11.5 9.23 1.4e+02 5.8 3.9 4.4
qwen2.5-coder-32b-instruct 11.4 8.99 1.1e+03 5.8 4.1 4.1
google_gemma_3_4b_it 10.8 8.25 1.1e+03 5.7 4.5 3.5
qwen3-1.7b 8.55 6.75 1.1e+03 5.1 3.3 3.9
qwen2.5-coder-14b-instruct 7.95 6.12 1.1e+03 4.9 2.9 4
qwen2.5-coder-7b-instruct 3.14 2.42 1.1e+03 3.2 1.1 3
qwen2-72b-instruct 2.62 2.13 1.1e+03 2.9 0.89 2.8
llama-3.1-70B-instruct 2.5 1.99 1.1e+03 2.8 1.1 2.6
qwen2.5-coder-3b-instruct 1.39 1.14 1.1e+03 2.1 0.53 2.1
mistralai_ministral_8b_instruct_2410 1.17 0.955 1.1e+03 2 0.45 1.9
qwen3-0.6b 1.08 0.87 1.1e+03 1.9 0.45 1.8
mistralai_mixtral_8x22b_instruct_v0.1 1.02 0.861 1.1e+03 1.8 0.37 1.8
google_gemma_2_27b_it 0.836 0.682 1.1e+03 1.7 0.41 1.6
llama-3.1-8B-instruct 0.806 0.697 1.1e+03 1.6 0.24 1.6
llama-3.2-3B-instruct 0.761 0.642 1.1e+03 1.6 0.35 1.5
mistralai_mathstral_7b_v0.1 0.676 0.585 1.1e+03 1.5 0.19 1.5
qwen2-7b-instruct 0.552 0.465 1.1e+03 1.4 0.22 1.3
qwen1.5-72b-chat 0.418 0.381 1.1e+03 1.2 0.11 1.2
qwen1.5-32b-chat 0.409 0.362 1.1e+03 1.2 0.12 1.2
google_gemma_2_9b_it 0.306 0.245 1.1e+03 1 0.18 0.99
mistralai_mixtral_8x7b_instruct_v0.1 0.306 0.278 1.1e+03 1 0.092 1
google_gemma_3_1b_it 0.276 0.207 1.1e+03 0.96 0.14 0.95
qwen2.5-coder-1.5b-instruct 0.255 0.229 1.1e+03 0.92 0.067 0.92
qwen1.5-14b-chat 0.206 0.185 1.1e+03 0.83 0.059 0.83
deepseek_v2_lite_chat 0.194 0.175 1.1e+03 0.8 0.042 0.8
google_codegemma_1.1_7b_it 0.161 0.151 1.1e+03 0.73 0.061 0.73
qwen2.5-coder-0.5b-instruct 0.161 0.153 1.1e+03 0.73 0.085 0.73
mistralai_mistral_7b_instruct_v0.3 0.155 0.13 1.1e+03 0.72 0.063 0.71
qwen1.5-7b-chat 0.13 0.115 1.1e+03 0.66 0.032 0.66
google_gemma_7b_it 0.124 0.104 1.1e+03 0.64 0.042 0.64
llama-3.2-1B-instruct 0.124 0.112 1.1e+03 0.64 0.026 0.64
mistralai_mistral_7b_instruct_v0.2 0.097 0.0913 1.1e+03 0.57 0.039 0.57
qwen1.5-0.5b-chat 0.0727 0.0665 1.1e+03 0.49 0.032 0.49
qwen2-0.5b-instruct 0.0545 0.0526 1.1e+03 0.43 0.017 0.43
qwen1.5-1.8b-chat 0.0455 0.0443 1.1e+03 0.39 0.016 0.39
qwen2-1.5b-instruct 0.0333 0.0292 1.1e+03 0.33 0.006 0.33
mistralai_mistral_7b_instruct_v0.1 0.0303 0.0293 1.1e+03 0.32 0.0099 0.32
google_gemma_2b_it 0.00909 0.00806 1.1e+03 0.17 0 0.17