math500_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 86.5 95.6 43.3 8 1.5 1.2 0.9
deepseek_r1_distill_qwen_32b 84.8 98.2 42.1 1.1e+03 1.6 1.2 1.1
qwen3-32b 82.3 98 39.9 7e+02 1.7 1.3 1.1
qwen3-14b 82.3 97.8 39.8 1.1e+03 1.7 1.4 1
deepseek_r1_distill_llama_70b 82.1 97.8 40.1 1.1e+03 1.7 1.3 1.2
qwen2-math-72b-instruct 81.1 94.6 38.7 35 1.7 1.5 0.98
google_gemma_3_12b_it 80.2 98.4 38.3 1.1e+03 1.8 1.5 1
qwen3-8b 79.7 98.2 37.8 1.1e+03 1.8 1.4 1.1
qwen3-4b 78.2 97.8 36.8 1.1e+03 1.8 1.5 1.1
deepseek_r1_distill_qwen_7b 78.1 98.2 37.2 1.1e+03 1.8 1.4 1.2
qwen2.5-coder-32b-instruct 77 97.2 35.7 1.1e+03 1.9 1.5 1.1
qwen2.5-coder-14b-instruct 72.6 98 32.7 1.1e+03 2 1.6 1.2
deepseek_r1_distill_llama_8b 70.5 95.2 32.8 71 2 1.5 1.4
deepseek_r1_distill_qwen_1.5b 68.7 98 30.7 1.1e+03 2.1 1.5 1.4
qwen3-1.7b 67.9 97.6 29.9 1.1e+03 2.1 1.6 1.3
llama-3.1-70B-instruct 66.4 88.6 29.1 1e+03 2.1 1.7 1.3
google_gemma_3_4b_it 65.7 97.2 29.4 1.1e+03 2.1 1.7 1.3
qwen2-72b-instruct 65.3 96.4 27.9 2.2e+02 2.1 1.7 1.3
qwen2.5-coder-7b-instruct 62.4 96.8 26.5 1.1e+03 2.2 1.6 1.5
google_gemma_2_27b_it 53.1 93.8 20.8 1.1e+03 2.2 1.9 1.2
qwen2-7b-instruct 52.5 96.8 20.5 1.1e+03 2.2 1.7 1.5
mistralai_ministral_8b_instruct_2410 49.3 97.4 19 1.1e+03 2.2 1.6 1.5
mistralai_mathstral_7b_v0.1 48.5 97.8 18.6 1.1e+03 2.2 1.6 1.5
llama-3.1-8B-instruct 48.4 93.8 18.7 1.1e+03 2.2 1.7 1.5
mistralai_mixtral_8x22b_instruct_v0.1 47.8 91.8 18.4 1e+02 2.2 1.6 1.5
qwen2.5-coder-3b-instruct 47.4 97.2 18.7 1.1e+03 2.2 1.5 1.7
google_gemma_2_9b_it 47.4 90.6 17.6 1.1e+03 2.2 1.9 1.2
llama-3.2-3B-instruct 44.5 91.8 16.7 1.1e+03 2.2 1.7 1.4
qwen1.5-72b-chat 40.6 93.4 14.9 3.9e+02 2.2 1.6 1.5
qwen1.5-32b-chat 39.9 82.2 14.6 30 2.2 1.6 1.5
qwen3-0.6b 33.4 95 12.1 1.1e+03 2.1 1.4 1.5
qwen2.5-coder-1.5b-instruct 33.1 95.6 11.7 1.1e+03 2.1 1.4 1.6
qwen1.5-14b-chat 30.9 95.2 10.4 1.1e+03 2.1 1.5 1.5
llama-3.2-1B-instruct 26.6 87.8 8.92 1.1e+03 2 1.4 1.4
google_codegemma_1.1_7b_it 21.9 91.2 7.04 1.1e+03 1.9 1.3 1.3
deepseek_v2_lite_chat 21.4 93.8 6.98 1.1e+03 1.8 1.1 1.4
qwen1.5-7b-chat 17 92.2 5.27 1.1e+03 1.7 1.1 1.3
google_gemma_3_1b_it 14.5 86.6 6.01 1.1e+03 1.6 1 1.2
mistralai_mistral_7b_instruct_v0.3 13.9 91.4 4.34 1.1e+03 1.5 0.97 1.2
mistralai_mistral_7b_instruct_v0.2 10.4 86 3.22 9.2e+02 1.4 0.79 1.1
qwen2.5-coder-0.5b-instruct 8.66 87.4 2.77 1.1e+03 1.3 0.68 1.1
qwen2-1.5b-instruct 7.18 89.4 2.17 1.1e+03 1.2 0.51 1
mistralai_mistral_7b_instruct_v0.1 6.04 83.6 1.88 1.1e+03 1.1 0.48 0.95
google_gemma_7b_it 5.78 64.8 1.72 1.1e+03 1 0.68 0.79
qwen2-0.5b-instruct 3.34 83.2 1.13 1.1e+03 0.8 0.28 0.75
qwen1.5-1.8b-chat 1.53 73.8 0.539 1.1e+03 0.55 0.14 0.53
qwen1.5-0.5b-chat 0.92 64.8 0.386 8.9e+02 0.43 0.089 0.42
google_gemma_2b_it 0.196 26.6 0.0683 1.1e+03 0.2 0.041 0.19