math500_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 85.3 42.2 11 1.6 1.3 0.92
deepseek_r1_distill_qwen_32b 83.9 41.3 10 1.6 1.2 1.1
qwen3-32b 81.9 39.4 9 1.7 1.3 1.1
deepseek_r1_distill_llama_70b 81.1 39.4 10 1.7 1.3 1.2
qwen3-14b 81 38.7 11 1.8 1.4 1.1
qwen2-math-72b-instruct 80.4 38.1 7 1.8 1.5 1
qwen3-8b 79.2 37.3 11 1.8 1.5 1.1
google_gemma_3_12b_it 79 37.3 12 1.8 1.5 1
deepseek_r1_distill_qwen_14b 78.9 37.9 12 1.8 1.3 1.3
qwen3-4b 77.2 36 12 1.9 1.5 1.1
deepseek_r1_distill_qwen_7b 77.1 36.5 12 1.9 1.4 1.3
qwen2.5-coder-32b-instruct 76.1 35 10 1.9 1.5 1.2
qwen2.5-coder-14b-instruct 71.3 31.7 11 2 1.6 1.2
deepseek_r1_distill_llama_8b 70.4 32.7 12 2 1.5 1.4
qwen2-math-1.5b-instruct 68 29.7 1 2.1 NaN NaN
deepseek_r1_distill_qwen_1.5b 67.6 30.1 12 2.1 1.5 1.4
qwen3-1.7b 66.7 29.1 12 2.1 1.6 1.3
llama-3.1-70B-instruct 65.6 28.7 12 2.1 2.1 0.21
google_gemma_3_4b_it 65.1 29.1 12 2.1 1.7 1.3
qwen2-72b-instruct 64.5 27.2 10 2.1 1.7 1.3
qwen2.5-coder-7b-instruct 61 25.7 12 2.2 1.6 1.5
google_gemma_2_27b_it 52.3 20.4 10 2.2 1.9 1.2
qwen2-7b-instruct 51.7 20.1 12 2.2 1.7 1.5
mistralai_ministral_8b_instruct_2410 49.3 18.9 12 2.2 1.7 1.5
mistralai_mixtral_8x22b_instruct_v0.1 48.1 18.4 9 2.2 1.7 1.5
llama-3.1-8B-instruct 47.6 18 13 2.2 2 1.1
mistralai_mathstral_7b_v0.1 47 17.7 12 2.2 1.7 1.5
qwen2.5-coder-3b-instruct 46.9 18.4 11 2.2 1.5 1.6
google_gemma_2_9b_it 46.4 16.9 11 2.2 1.9 1.2
llama-3.2-3B-instruct 44.1 16.6 18 2.2 1.9 1.1
qwen1.5-72b-chat 40.3 14.7 10 2.2 1.6 1.5
qwen1.5-32b-chat 38.9 14.1 9 2.2 1.6 1.5
qwen3-0.6b 33.7 12.2 13 2.1 1.5 1.5
qwen2.5-coder-1.5b-instruct 32.8 11.3 12 2.1 1.5 1.5
qwen1.5-14b-chat 30.5 10.3 11 2.1 1.5 1.4
llama-3.2-1B-instruct 27.9 9.67 11 2 2 0.23
mistralai_mixtral_8x7b_instruct_v0.1 25.2 8.44 11 1.9 1.3 1.4
deepseek_v2_lite_chat 22.5 7.25 9 1.9 1.2 1.4
google_codegemma_1.1_7b_it 20.8 6.55 12 1.8 1.3 1.3
qwen1.5-7b-chat 16.5 5.19 11 1.7 1.1 1.3
google_gemma_3_1b_it 14.5 5.93 12 1.6 1 1.2
mistralai_mistral_7b_instruct_v0.3 12.9 3.86 12 1.5 0.95 1.2
mistralai_mistral_7b_instruct_v0.2 10.1 2.98 12 1.3 0.82 1.1
qwen2.5-coder-0.5b-instruct 8.03 2.48 13 1.2 0.7 1
qwen2-1.5b-instruct 6.8 1.96 12 1.1 0.53 0.99
mistralai_mistral_7b_instruct_v0.1 6.4 1.99 12 1.1 0.58 0.93
google_gemma_7b_it 5.77 1.68 12 1 0.68 0.79
qwen2-0.5b-instruct 2.91 0.985 13 0.75 0.31 0.69
qwen1.5-1.8b-chat 1.42 0.434 11 0.53 0.16 0.5
qwen1.5-0.5b-chat 0.8 0.307 12 0.4 0.11 0.38
google_gemma_2b_it 0.117 0.047 12 0.15 0 0.15