mgsm_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 89.4 40.6 11 0.59 NaN NaN
google_gemma_3_27b_it 89.2 40.5 10 0.59 0.52 0.28
deepseek_r1_distill_llama_70b 86.7 39 11 0.65 NaN NaN
google_gemma_3_12b_it 86.3 38.3 12 0.66 NaN NaN
llama-3.1-70B-instruct 86.1 38.6 12 0.66 0.66 0
qwen3-14b 84.4 36.7 11 0.69 0.58 0.37
google_gemma_2_27b_it 83.9 36.8 10 0.7 0.58 0.39
qwen2-72b-instruct 80.6 34 11 0.75 NaN NaN
qwen3-8b 80.5 33.9 11 0.76 0.64 0.41
deepseek_r1_distill_qwen_32b 80.3 33.9 11 0.76 NaN NaN
qwen2.5-coder-32b-instruct 79.4 33.2 11 0.77 NaN NaN
qwen2-math-72b-instruct 79 33.2 11 0.78 NaN NaN
google_gemma_2_9b_it 79 33.4 11 0.78 0.64 0.44
deepseek_r1_distill_qwen_14b 74.6 30.1 13 0.83 NaN NaN
qwen3-4b 74.4 29.8 12 0.83 0.72 0.42
google_gemma_3_4b_it 71.8 29.1 13 0.86 0.68 0.53
mistralai_mixtral_8x22b_instruct_v0.1 71.7 28.5 10 0.86 NaN NaN
qwen2.5-coder-14b-instruct 68.8 26.7 11 0.88 0.69 0.56
qwen1.5-72b-chat 65.3 24.6 11 0.91 NaN NaN
mistralai_ministral_8b_instruct_2410 64.6 24.2 12 0.91 NaN NaN
llama-3.1-8B-instruct 64.2 25.1 15 0.91 0.91 0
deepseek_r1_distill_qwen_7b 61.7 22.6 13 0.93 NaN NaN
mistralai_mathstral_7b_v0.1 60.1 22.6 13 0.93 NaN NaN
qwen1.5-32b-chat 59.2 21.7 10 0.94 NaN NaN
qwen2.5-coder-7b-instruct 57.3 20.1 13 0.94 NaN NaN
qwen2-7b-instruct 57 20.3 13 0.94 NaN NaN
qwen3-1.7b 54.6 18.9 12 0.95 0.78 0.54
llama-3.2-3B-instruct 51.1 18.3 18 0.95 0.95 0
qwen2-math-7b-instruct 50.8 17 12 0.95 0.74 0.6
qwen1.5-14b-chat 46.8 15.4 11 0.95 0.69 0.65
mistralai_mixtral_8x7b_instruct_v0.1 45.3 15.4 10 0.95 0.67 0.67
deepseek_r1_distill_llama_8b 44.8 14.6 12 0.95 0.7 0.64
google_codegemma_1.1_7b_it 41.5 13.9 13 0.94 0.68 0.65
deepseek_v2_lite_chat 40.2 13 9.9 0.94 NaN NaN
qwen2.5-coder-3b-instruct 40.2 12.5 12 0.93 0.66 0.67
deepseek_r1_distill_qwen_1.5b 33.4 9.7 12 0.9 0.68 0.59
qwen1.5-7b-chat 31.9 9.27 11 0.89 0.62 0.63
qwen2-math-1.5b-instruct 31.2 8.91 12 0.88 0.71 0.53
mistralai_mistral_7b_instruct_v0.3 29.1 8.63 13 0.87 NaN NaN
google_gemma_3_1b_it 26.4 7.53 12 0.84 0.62 0.56
mistralai_mistral_7b_instruct_v0.2 24.6 7.34 13 0.82 NaN NaN
qwen2.5-coder-1.5b-instruct 23.1 6.16 12 0.8 0.53 0.6
qwen3-0.6b 22.8 6.32 13 0.8 0.59 0.54
llama-3.2-1B-instruct 19.5 5.83 8 0.76 0.76 0
google_gemma_7b_it 18.1 5.31 12 0.73 0.54 0.5
mistralai_mistral_7b_instruct_v0.1 17.7 4.72 13 0.73 NaN NaN
qwen2-1.5b-instruct 14.6 3.68 12 0.67 0.42 0.53
google_gemma_2b_it 5.5 1.79 12 0.43 0.25 0.35
qwen1.5-1.8b-chat 5.38 1.38 11 0.43 0.2 0.38
qwen2-0.5b-instruct 5.05 1.33 13 0.42 0.2 0.37
qwen2.5-coder-0.5b-instruct 4.89 1.31 13 0.41 0.21 0.35
qwen1.5-0.5b-chat 2.21 0.69 13 0.28 0.093 0.26