math500_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 85.5 92.8 44.9 3 1.6 1.2 0.99
qwen3-32b 81.5 89.4 41.6 2 1.7 1.2 1.3
deepseek_r1_distill_qwen_32b 81.1 89 41.8 2 1.8 1.2 1.3
qwen3-14b 80.7 89.4 41 3 1.8 1.4 1.1
deepseek_r1_distill_llama_70b 78.9 87.2 40 2 1.8 1.3 1.3
google_gemma_3_12b_it 78.3 87.4 39.4 3 1.8 1.5 1.1
qwen3-8b 78.3 86.6 39.2 3 1.8 1.5 1.1
qwen3-4b 77.1 88 38.3 4 1.9 1.5 1.2
deepseek_r1_distill_qwen_14b 76.3 89 38.5 3 1.9 1.4 1.3
qwen2.5-coder-32b-instruct 74.4 82.2 36.3 2 2 1.5 1.2
deepseek_r1_distill_qwen_7b 72.7 87.6 36 3 2 1.4 1.4
qwen2-math-7b-instruct 70.7 79.2 33.6 2 2 1.6 1.3
qwen2.5-coder-14b-instruct 67.2 80.4 31.2 3 2.1 1.6 1.4
deepseek_r1_distill_llama_8b 67.1 84.8 32.7 4 2.1 1.5 1.5
qwen3-1.7b 65.4 79.6 30.4 4 2.1 1.7 1.3
google_gemma_3_4b_it 65 79.6 31 4 2.1 1.7 1.3
deepseek_r1_distill_qwen_1.5b 64.6 82.6 30.3 4 2.1 1.5 1.5
llama-3.1-70B-instruct 63.4 63.6 29.4 4 2.2 2.1 0.2
qwen2-72b-instruct 63 72.2 28.7 2 2.2 1.7 1.4
qwen2-math-1.5b-instruct 61.5 71.8 27.7 2 2.2 1.6 1.4
qwen2.5-coder-7b-instruct 53.7 71.2 23.4 3 2.2 1.6 1.5
google_gemma_2_27b_it 52.8 52.8 22.1 1 2.2 NaN NaN
qwen2-7b-instruct 45.7 64.8 18.5 3 2.2 1.6 1.6
llama-3.1-8B-instruct 45.1 45.4 18.7 6 2.2 2.2 0.25
google_gemma_2_9b_it 44.9 56.4 17.8 3 2.2 1.9 1.2
mistralai_mixtral_8x22b_instruct_v0.1 43.8 57.8 17.9 2 2.2 1.5 1.7
qwen2.5-coder-3b-instruct 40.8 65.2 16.9 4 2.2 1.4 1.7
mistralai_mathstral_7b_v0.1 39.8 60.4 16.1 3 2.2 1.4 1.7
mistralai_ministral_8b_instruct_2410 39.8 58.2 15.8 3 2.2 1.5 1.6
llama-3.2-3B-instruct 38.3 38.8 15.1 10 2.2 2.2 0.27
qwen1.5-72b-chat 37.9 48.4 14.7 2 2.2 1.6 1.4
qwen1.5-32b-chat 37.1 49.8 14.6 2 2.2 1.5 1.6
qwen3-0.6b 31.6 50.8 12.3 3 2.1 1.4 1.6
qwen1.5-14b-chat 28.3 44.4 10.1 3 2 1.5 1.4
qwen2.5-coder-1.5b-instruct 26.2 49.6 9.64 4 2 1.2 1.5
mistralai_mixtral_8x7b_instruct_v0.1 23.3 40.4 8.16 3 1.9 1.2 1.4
llama-3.2-1B-instruct 20.5 20.6 7.33 11 1.8 1.8 0.16
deepseek_v2_lite_chat 18.9 27.2 6.46 2 1.8 1.2 1.3
google_codegemma_1.1_7b_it 18.6 34.8 6.2 4 1.7 1.2 1.3
qwen1.5-7b-chat 15.9 28.6 5.26 3 1.6 1.1 1.2
google_gemma_3_1b_it 13.6 30.2 5.92 4 1.5 0.91 1.2
mistralai_mistral_7b_instruct_v0.3 10.5 20.6 3.39 3 1.4 0.81 1.1
mistralai_mistral_7b_instruct_v0.2 9 19 3.06 3 1.3 0.71 1.1
qwen2.5-coder-0.5b-instruct 6 18 2.38 4 1.1 0.44 0.97
google_gemma_7b_it 5.75 13.6 1.86 4 1 0.64 0.82
qwen2-1.5b-instruct 4.65 13.2 1.54 4 0.94 0.42 0.84
mistralai_mistral_7b_instruct_v0.1 4.13 9.8 1.42 3 0.89 0.39 0.8
qwen2-0.5b-instruct 2.05 6.8 0.756 4 0.63 0.21 0.6
qwen1.5-1.8b-chat 0.867 2.2 0.293 3 0.41 0.16 0.38
google_gemma_2b_it 0.35 1.4 0.104 4 0.26 0 0.26
qwen1.5-0.5b-chat 0.3 1.2 0.145 4 0.24 0 0.24