math500_cot: by models

Home


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 85.3 94.2 42.2 11 1.6 1.3 0.92
deepseek_r1_distill_qwen_32b 83.9 95.2 41.3 10 1.6 1.2 1.1
qwen3-32b 81.9 92.8 39.4 9 1.7 1.3 1.1
deepseek_r1_distill_llama_70b 81.1 94.2 39.4 10 1.7 1.3 1.2
qwen3-14b 81 93 38.7 11 1.8 1.4 1.1
qwen2-math-72b-instruct 80.4 91.4 38.1 7 1.8 1.5 1
qwen3-8b 79.2 93.4 37.3 11 1.8 1.5 1.1
google_gemma_3_12b_it 79 92 37.3 12 1.8 1.5 1
deepseek_r1_distill_qwen_14b 78.9 94.2 37.9 12 1.8 1.3 1.3
qwen3-4b 77.2 92.2 36 12 1.9 1.5 1.1
deepseek_r1_distill_qwen_7b 77.1 93 36.5 12 1.9 1.4 1.3
qwen2.5-coder-32b-instruct 76.1 90.2 35 10 1.9 1.5 1.2
qwen2.5-coder-14b-instruct 71.3 89 31.7 11 2 1.6 1.2
deepseek_r1_distill_llama_8b 70.4 91.2 32.7 12 2 1.5 1.4
qwen2-math-1.5b-instruct 68 68 29.7 1 2.1 NaN NaN
deepseek_r1_distill_qwen_1.5b 67.6 89.6 30.1 12 2.1 1.5 1.4
qwen3-1.7b 66.7 87 29.1 12 2.1 1.6 1.3
llama-3.1-70B-instruct 65.6 65.8 28.7 12 2.1 2.1 0.21
google_gemma_3_4b_it 65.1 84.8 29.1 12 2.1 1.7 1.3
qwen2-72b-instruct 64.5 83.6 27.2 10 2.1 1.7 1.3
qwen2.5-coder-7b-instruct 61 86.8 25.7 12 2.2 1.6 1.5
google_gemma_2_27b_it 52.3 74.2 20.4 10 2.2 1.9 1.2
qwen2-7b-instruct 51.7 80.2 20.1 12 2.2 1.7 1.5
mistralai_ministral_8b_instruct_2410 49.3 79.2 18.9 12 2.2 1.7 1.5
mistralai_mixtral_8x22b_instruct_v0.1 48.1 77.8 18.4 9 2.2 1.7 1.5
llama-3.1-8B-instruct 47.6 58.6 18 13 2.2 2 1.1
mistralai_mathstral_7b_v0.1 47 76.6 17.7 12 2.2 1.7 1.5
qwen2.5-coder-3b-instruct 46.9 80 18.4 11 2.2 1.5 1.6
google_gemma_2_9b_it 46.4 67.8 16.9 11 2.2 1.9 1.2
llama-3.2-3B-instruct 44.1 55.6 16.6 18 2.2 1.9 1.1
qwen1.5-72b-chat 40.3 73 14.7 10 2.2 1.6 1.5
qwen1.5-32b-chat 38.9 70.4 14.1 9 2.2 1.6 1.5
qwen3-0.6b 33.7 71.4 12.2 13 2.1 1.5 1.5
qwen2.5-coder-1.5b-instruct 32.8 66.4 11.3 12 2.1 1.5 1.5
qwen1.5-14b-chat 30.5 63.6 10.3 11 2.1 1.5 1.4
llama-3.2-1B-instruct 27.9 28.2 9.67 11 2 2 0.23
mistralai_mixtral_8x7b_instruct_v0.1 25.2 61.2 8.44 11 1.9 1.3 1.4
deepseek_v2_lite_chat 22.5 55.8 7.25 9 1.9 1.2 1.4
google_codegemma_1.1_7b_it 20.8 52 6.55 12 1.8 1.3 1.3
qwen1.5-7b-chat 16.5 45.4 5.19 11 1.7 1.1 1.3
google_gemma_3_1b_it 14.5 43.2 5.93 12 1.6 1 1.2
mistralai_mistral_7b_instruct_v0.3 12.9 41 3.86 12 1.5 0.95 1.2
mistralai_mistral_7b_instruct_v0.2 10.1 37.6 2.98 12 1.3 0.82 1.1
qwen2.5-coder-0.5b-instruct 8.03 32.6 2.48 13 1.2 0.7 1
qwen2-1.5b-instruct 6.8 31.6 1.96 12 1.1 0.53 0.99
mistralai_mistral_7b_instruct_v0.1 6.4 26.8 1.99 12 1.1 0.58 0.93
google_gemma_7b_it 5.77 21.4 1.68 12 1 0.68 0.79
qwen2-0.5b-instruct 2.91 19.4 0.985 13 0.75 0.31 0.69
qwen1.5-1.8b-chat 1.42 10.6 0.434 11 0.53 0.16 0.5
qwen1.5-0.5b-chat 0.8 6.6 0.307 12 0.4 0.11 0.38
google_gemma_2b_it 0.117 1.4 0.047 12 0.15 0 0.15