math500_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 85.5 44.9 3 1.6 1.2 0.99
qwen3-32b 81.5 41.6 2 1.7 1.2 1.3
deepseek_r1_distill_qwen_32b 81.1 41.8 2 1.8 1.2 1.3
qwen3-14b 80.7 41 3 1.8 1.4 1.1
deepseek_r1_distill_llama_70b 78.9 40 2 1.8 1.3 1.3
google_gemma_3_12b_it 78.3 39.4 3 1.8 1.5 1.1
qwen3-8b 78.3 39.2 3 1.8 1.5 1.1
qwen3-4b 77.1 38.3 4 1.9 1.5 1.2
deepseek_r1_distill_qwen_14b 76.3 38.5 3 1.9 1.4 1.3
qwen2.5-coder-32b-instruct 74.4 36.3 2 2 1.5 1.2
deepseek_r1_distill_qwen_7b 72.7 36 3 2 1.4 1.4
qwen2-math-7b-instruct 70.7 33.6 2 2 1.6 1.3
qwen2.5-coder-14b-instruct 67.2 31.2 3 2.1 1.6 1.4
deepseek_r1_distill_llama_8b 67.1 32.7 4 2.1 1.5 1.5
qwen3-1.7b 65.4 30.4 4 2.1 1.7 1.3
google_gemma_3_4b_it 65 31 4 2.1 1.7 1.3
deepseek_r1_distill_qwen_1.5b 64.6 30.3 4 2.1 1.5 1.5
llama-3.1-70B-instruct 63.4 29.4 4 2.2 2.1 0.2
qwen2-72b-instruct 63 28.7 2 2.2 1.7 1.4
qwen2-math-1.5b-instruct 61.5 27.7 2 2.2 1.6 1.4
qwen2.5-coder-7b-instruct 53.7 23.4 3 2.2 1.6 1.5
google_gemma_2_27b_it 52.8 22.1 1 2.2 NaN NaN
qwen2-7b-instruct 45.7 18.5 3 2.2 1.6 1.6
llama-3.1-8B-instruct 45.1 18.7 6 2.2 2.2 0.25
google_gemma_2_9b_it 44.9 17.8 3 2.2 1.9 1.2
mistralai_mixtral_8x22b_instruct_v0.1 43.8 17.9 2 2.2 1.5 1.7
qwen2.5-coder-3b-instruct 40.8 16.9 4 2.2 1.4 1.7
mistralai_mathstral_7b_v0.1 39.8 16.1 3 2.2 1.4 1.7
mistralai_ministral_8b_instruct_2410 39.8 15.8 3 2.2 1.5 1.6
llama-3.2-3B-instruct 38.3 15.1 10 2.2 2.2 0.27
qwen1.5-72b-chat 37.9 14.7 2 2.2 1.6 1.4
qwen1.5-32b-chat 37.1 14.6 2 2.2 1.5 1.6
qwen3-0.6b 31.6 12.3 3 2.1 1.4 1.6
qwen1.5-14b-chat 28.3 10.1 3 2 1.5 1.4
qwen2.5-coder-1.5b-instruct 26.2 9.64 4 2 1.2 1.5
mistralai_mixtral_8x7b_instruct_v0.1 23.3 8.16 3 1.9 1.2 1.4
llama-3.2-1B-instruct 20.5 7.33 11 1.8 1.8 0.16
deepseek_v2_lite_chat 18.9 6.46 2 1.8 1.2 1.3
google_codegemma_1.1_7b_it 18.6 6.2 4 1.7 1.2 1.3
qwen1.5-7b-chat 15.9 5.26 3 1.6 1.1 1.2
google_gemma_3_1b_it 13.6 5.92 4 1.5 0.91 1.2
mistralai_mistral_7b_instruct_v0.3 10.5 3.39 3 1.4 0.81 1.1
mistralai_mistral_7b_instruct_v0.2 9 3.06 3 1.3 0.71 1.1
qwen2.5-coder-0.5b-instruct 6 2.38 4 1.1 0.44 0.97
google_gemma_7b_it 5.75 1.86 4 1 0.64 0.82
qwen2-1.5b-instruct 4.65 1.54 4 0.94 0.42 0.84
mistralai_mistral_7b_instruct_v0.1 4.13 1.42 3 0.89 0.39 0.8
qwen2-0.5b-instruct 2.05 0.756 4 0.63 0.21 0.6
qwen1.5-1.8b-chat 0.867 0.293 3 0.41 0.16 0.38
google_gemma_2b_it 0.35 0.104 4 0.26 0 0.26
qwen1.5-0.5b-chat 0.3 0.145 4 0.24 0 0.24