math_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 86.5 50.3 2 0.48 0.4 0.27
qwen3-14b 82.4 47 2 0.54 0.43 0.33
google_gemma_3_12b_it 80.2 45.2 3 0.56 0.46 0.33
qwen3-4b 78.1 43.6 3 0.58 0.47 0.35
qwen3-8b 77.8 43.4 2 0.59 0.45 0.38
qwen3-32b 75.2 42.2 2 0.61 0.39 0.47
google_gemma_3_4b_it 71.5 38.7 4 0.64 0.53 0.36
deepseek_r1_distill_llama_70b 70.9 40.7 1 0.64 NaN NaN
deepseek_r1_distill_qwen_7b 68.4 39.1 3 0.66 0.35 0.56
deepseek_r1_distill_llama_8b 65.6 37 3 0.67 0.37 0.56
qwen3-1.7b 63 32.8 3 0.68 0.52 0.44
deepseek_r1_distill_qwen_14b 59.9 34.1 3 0.69 0.38 0.58
deepseek_r1_distill_qwen_1.5b 59 32 3 0.7 0.38 0.58
llama-3.1-70B-instruct 58.5 29.8 3 0.7 0.69 0.077
deepseek_r1_distill_qwen_32b 57.8 32.6 1 0.7 NaN NaN
qwen2.5-coder-32b-instruct 54.4 27.8 1 0.7 NaN NaN
qwen2-72b-instruct 54.1 26.7 1 0.7 NaN NaN
google_gemma_2_9b_it 44 20 3 0.7 0.57 0.41
qwen1.5-72b-chat 41.9 19.2 1 0.7 NaN NaN
qwen2.5-coder-14b-instruct 41.4 19.4 2 0.7 0.45 0.53
qwen1.5-32b-chat 39.6 17.8 2 0.69 0.5 0.47
qwen2.5-coder-7b-instruct 35.3 16 3 0.68 0.41 0.54
mistralai_mixtral_8x22b_instruct_v0.1 33.1 14.4 2 0.67 0.45 0.49
qwen2-7b-instruct 32.8 14.8 3 0.66 0.42 0.51
mistralai_mathstral_7b_v0.1 32.7 14.2 3 0.66 0.44 0.5
qwen3-0.6b 32.4 14 4 0.66 0.47 0.46
google_gemma_3_1b_it 32.1 13.9 3 0.66 0.51 0.42
llama-3.2-3B-instruct 30.9 13.2 10 0.65 0.65 0.055
qwen2.5-coder-3b-instruct 29.9 13.1 3 0.65 0.39 0.52
llama-3.1-8B-instruct 28.7 12.2 7 0.64 0.64 0.057
qwen1.5-14b-chat 28.5 11.9 2 0.64 0.45 0.46
mistralai_ministral_8b_instruct_2410 28.5 12 3 0.64 0.41 0.49
mistralai_mixtral_8x7b_instruct_v0.1 22.7 9.09 2 0.59 0.39 0.45
deepseek_v2_lite_chat 22 8.79 2 0.59 0.38 0.45
qwen2.5-coder-1.5b-instruct 19.9 8.05 3 0.56 0.32 0.47
google_codegemma_1.1_7b_it 18.1 7.03 4 0.55 0.36 0.41
qwen1.5-7b-chat 17.6 6.8 3 0.54 0.32 0.43
llama-3.2-1B-instruct 12.4 4.6 12 0.47 0.46 0.042
google_gemma_7b_it 11.3 4.25 3 0.45 0.32 0.32
mistralai_mistral_7b_instruct_v0.3 11.1 4.05 3 0.44 0.26 0.36
mistralai_mistral_7b_instruct_v0.2 9.4 3.36 3 0.41 0.26 0.32
qwen2-1.5b-instruct 8.98 3.4 3 0.4 0.19 0.35
google_gemma_2b_it 6.05 2.29 3 0.34 0.21 0.26
qwen2.5-coder-0.5b-instruct 5.22 1.95 4 0.31 0.14 0.28
mistralai_mistral_7b_instruct_v0.1 4.87 1.73 3 0.3 0.15 0.27
qwen2-0.5b-instruct 3.92 1.39 4 0.27 0.11 0.25
qwen1.5-1.8b-chat 3.29 1.15 3 0.25 0.11 0.23
qwen1.5-0.5b-chat 0.435 0.145 4 0.093 0.024 0.09