mgsm_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 89.2 43.5 3 0.59 0.5 0.32
qwen3-32b 89 43.3 2.3 0.6 NaN NaN
google_gemma_3_12b_it 86.2 41.2 3.4 0.66 NaN NaN
deepseek_r1_distill_llama_70b 85.8 41.2 2.2 0.67 NaN NaN
qwen3-14b 84.4 39.6 3 0.69 0.58 0.38
google_gemma_2_27b_it 82.5 38.8 1 0.73 NaN NaN
qwen3-8b 80.1 36.5 3 0.76 0.62 0.44
qwen2-72b-instruct 79.4 35.9 2.2 0.77 NaN NaN
deepseek_r1_distill_qwen_32b 79.4 36.1 2.2 0.77 NaN NaN
google_gemma_2_9b_it 77.8 35.5 3 0.79 0.62 0.49
qwen2.5-coder-32b-instruct 77.3 34.6 2.2 0.8 NaN NaN
llama-3.1-70B-instruct 77 35.8 4 0.8 0.8 0
qwen3-4b 73.7 32.1 4 0.84 0.71 0.44
qwen2-math-72b-instruct 73.7 32.7 2.2 0.84 NaN NaN
deepseek_r1_distill_qwen_14b 73.7 32.3 3.4 0.84 NaN NaN
google_gemma_3_4b_it 71.2 31.3 4 0.86 0.66 0.55
mistralai_mixtral_8x22b_instruct_v0.1 68 28.9 2.3 0.89 NaN NaN
qwen2.5-coder-14b-instruct 64.9 26.9 3 0.91 0.67 0.61
qwen1.5-72b-chat 63.7 26 2.2 0.92 NaN NaN
qwen1.5-32b-chat 58.6 23.5 2.3 0.94 NaN NaN
deepseek_r1_distill_qwen_7b 58 22.9 3.4 0.94 NaN NaN
qwen3-1.7b 54.9 21.1 4 0.95 0.77 0.56
qwen2-7b-instruct 52.6 19.9 3.4 0.95 NaN NaN
mistralai_ministral_8b_instruct_2410 51 19.2 3.4 0.95 NaN NaN
mistralai_mathstral_7b_v0.1 50.9 20.1 3.4 0.95 NaN NaN
qwen2.5-coder-7b-instruct 50.8 18.9 3.4 0.95 NaN NaN
mistralai_mixtral_8x7b_instruct_v0.1 44.4 16.7 3 0.95 0.64 0.7
llama-3.1-8B-instruct 44 16.8 7 0.95 0.95 0
qwen1.5-14b-chat 43.5 15.6 3 0.95 0.67 0.66
qwen2-math-7b-instruct 43.3 15.4 4 0.94 0.68 0.66
deepseek_r1_distill_llama_8b 40.4 14.2 4 0.94 0.66 0.67
google_codegemma_1.1_7b_it 38.4 14 4 0.93 0.63 0.68
deepseek_v2_lite_chat 37.2 13 2.3 0.92 NaN NaN
llama-3.2-3B-instruct 35.2 12.6 10 0.91 0.91 0
qwen2.5-coder-3b-instruct 34.3 11.5 4 0.91 0.61 0.67
qwen1.5-7b-chat 29.4 9.3 3 0.87 0.59 0.63
deepseek_r1_distill_qwen_1.5b 27.2 8.45 4 0.85 0.61 0.59
google_gemma_3_1b_it 26 8.38 4 0.84 0.6 0.59
mistralai_mistral_7b_instruct_v0.3 24.5 7.91 3.4 0.82 NaN NaN
mistralai_mistral_7b_instruct_v0.2 23.9 7.99 3.4 0.81 NaN NaN
qwen3-0.6b 21.9 6.84 4 0.79 0.55 0.57
qwen2-math-1.5b-instruct 20 5.99 4 0.76 0.54 0.54
google_gemma_7b_it 17.8 5.79 4 0.73 0.51 0.52
qwen2.5-coder-1.5b-instruct 17 4.89 4 0.72 0.43 0.57
mistralai_mistral_7b_instruct_v0.1 12.1 3.48 3.4 0.62 NaN NaN
qwen2-1.5b-instruct 8.15 2.3 4 0.52 0.26 0.46
google_gemma_2b_it 5.05 1.89 4 0.42 0.21 0.36
qwen2.5-coder-0.5b-instruct 3.46 1.08 4 0.35 0.14 0.32
qwen1.5-1.8b-chat 3.37 1.09 3 0.34 0.13 0.32
qwen2-0.5b-instruct 2.97 0.949 4 0.32 0.12 0.3
qwen1.5-0.5b-chat 1.42 0.6 4 0.23 0.055 0.22