aime2024_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
deepseek_r1_distill_llama_70b 37.5 31.5 1.1e+03 8.8 6.7 5.7
deepseek_r1_distill_qwen_32b 36.6 30.7 1.1e+03 8.8 6.8 5.6
deepseek_r1_distill_qwen_14b 31.4 26 1.1e+03 8.5 6.3 5.7
google_gemma_3_27b_it 27.8 22.5 1e+03 8.2 6.1 5.4
deepseek_r1_distill_qwen_7b 27.4 22.4 1.1e+03 8.1 6.1 5.4
qwen3-32b 24.1 19.1 1.1e+03 7.8 6.1 4.8
qwen3-14b 23.7 18.9 1.1e+03 7.8 6.1 4.8
qwen3-8b 21.8 17.2 1.1e+03 7.5 5.8 4.8
google_gemma_3_12b_it 20.5 15.9 1.1e+03 7.4 5.8 4.6
deepseek_r1_distill_llama_8b 20.4 16.4 1.1e+03 7.4 5.1 5.3
llama-3.1-70B-instruct 19.2 15.5 1.1e+03 7.2 5.3 4.9
qwen2-math-72b-instruct 16.7 12.9 1.6e+02 6.8 5 4.6
qwen3-4b 16.5 12.3 1.1e+03 6.8 5.2 4.3
deepseek_r1_distill_qwen_1.5b 14.3 11 1.1e+03 6.4 4.4 4.7
qwen2.5-coder-32b-instruct 13.3 9.78 1.1e+03 6.2 4.7 4
qwen3-1.7b 9.26 6.78 1.1e+03 5.3 3.8 3.7
qwen2.5-coder-14b-instruct 8.04 5.88 1.1e+03 5 3 3.9
google_gemma_3_4b_it 7.02 5.12 1.1e+03 4.7 2.7 3.8
llama-3.2-3B-instruct 5.76 4.45 1.1e+03 4.3 2 3.7
qwen2-72b-instruct 5.09 3.56 1.1e+03 4 2.4 3.2
qwen2.5-coder-7b-instruct 5.02 3.56 1.1e+03 4 2.1 3.4
llama-3.1-8B-instruct 4.84 3.58 1.1e+03 3.9 2 3.4
google_gemma_2_27b_it 4.12 2.71 1.1e+03 3.6 2.5 2.7
mistralai_mathstral_7b_v0.1 2.16 1.52 1.1e+03 2.7 1.1 2.4
qwen2-7b-instruct 1.8 1.28 1.1e+03 2.4 0.84 2.3
mistralai_ministral_8b_instruct_2410 1.65 1.22 1.1e+03 2.3 0.94 2.1
qwen2.5-coder-3b-instruct 1.62 1.22 1.1e+03 2.3 0.68 2.2
mistralai_mixtral_8x22b_instruct_v0.1 1.49 1.03 1.1e+03 2.2 0.71 2.1
qwen1.5-32b-chat 1.41 1.02 1.1e+03 2.2 0.64 2.1
google_gemma_2_9b_it 1.39 0.927 1.1e+03 2.1 1.1 1.9
llama-3.2-1B-instruct 1.36 1.11 1.1e+03 2.1 0.61 2
qwen3-0.6b 1.2 0.864 1.1e+03 2 0.85 1.8
qwen1.5-72b-chat 1.05 0.823 1.1e+03 1.9 0.42 1.8
qwen1.5-14b-chat 0.861 0.734 1.1e+03 1.7 0.38 1.6
qwen1.5-7b-chat 0.785 0.572 1.1e+03 1.6 0.57 1.5
qwen2.5-coder-1.5b-instruct 0.733 0.582 1.1e+03 1.6 0.33 1.5
mistralai_mixtral_8x7b_instruct_v0.1 0.4 0.309 1.1e+03 1.2 0.13 1.1
google_codegemma_1.1_7b_it 0.382 0.321 1.1e+03 1.1 0.13 1.1
deepseek_v2_lite_chat 0.376 0.348 1.1e+03 1.1 0.19 1.1
google_gemma_3_1b_it 0.37 0.31 1.1e+03 1.1 0.16 1.1
mistralai_mistral_7b_instruct_v0.3 0.206 0.169 1.1e+03 0.83 0.074 0.82
mistralai_mistral_7b_instruct_v0.2 0.161 0.138 1.1e+03 0.73 0.061 0.73
qwen2.5-coder-0.5b-instruct 0.155 0.121 1.1e+03 0.72 0.079 0.71
mistralai_mistral_7b_instruct_v0.1 0.1 0.0953 1.1e+03 0.58 0.055 0.57
qwen2-1.5b-instruct 0.0667 0.0584 1.1e+03 0.47 0.017 0.47
google_gemma_7b_it 0.0545 0.0492 1.1e+03 0.43 0.023 0.43
qwen2-0.5b-instruct 0.0364 0.0302 1.1e+03 0.35 0.01 0.35
qwen1.5-0.5b-chat 0.0273 0.0227 1.1e+03 0.3 0.01 0.3
qwen1.5-1.8b-chat 0.0182 0.0153 1.1e+03 0.25 0.0027 0.25
google_gemma_2b_it 0 0 1.1e+03 0 0 0