gsm8k_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 94.4 95.9 32.6 3 0.63 0.56 0.29
qwen3-14b 93.5 95.6 32.2 3 0.68 0.59 0.34
qwen3-32b 93.4 95.8 32.2 2 0.69 0.53 0.43
llama-3.1-70B-instruct 92.9 92.9 31.9 4 0.71 0.71 0
google_gemma_3_12b_it 92.3 95.6 31.4 4 0.73 0.61 0.4
qwen3-8b 92 94.8 31 3 0.75 0.63 0.4
qwen2.5-coder-32b-instruct 91.3 95.1 30.8 2 0.78 0.56 0.54
deepseek_r1_distill_llama_70b 90.6 94.8 30.9 2 0.8 0.57 0.56
google_gemma_2_27b_it 89.8 92.8 29.6 2 0.83 0.68 0.48
qwen2-72b-instruct 89.5 94.8 30.1 2 0.84 0.56 0.63
qwen3-4b 89.5 92.9 29.6 4 0.85 0.76 0.38
google_gemma_2_9b_it 86.8 92.8 28 3 0.93 0.74 0.57
qwen2-math-72b-instruct 84.4 93.9 27.7 2 1 0.52 0.85
google_gemma_3_4b_it 83.3 91.1 26.3 5 1 0.86 0.57
qwen1.5-72b-chat 82.7 90.1 26.4 2 1 0.72 0.75
deepseek_r1_distill_qwen_7b 81.9 94.5 26.3 4 1.1 0.67 0.82
qwen1.5-32b-chat 81.7 90.1 26 2 1.1 0.7 0.8
deepseek_r1_distill_qwen_14b 81.3 93.3 26.4 4 1.1 0.72 0.8
qwen2.5-coder-14b-instruct 81 95.6 26.1 3 1.1 0.58 0.91
deepseek_r1_distill_qwen_32b 80.3 91.3 26.8 2 1.1 0.61 0.91
qwen2-math-7b-instruct 79.8 93.3 24.8 4 1.1 0.74 0.82
llama-3.1-8B-instruct 78.3 78.3 23.8 7 1.1 1.1 0
mistralai_mixtral_8x22b_instruct_v0.1 78.3 89.3 24.4 2 1.1 0.67 0.91
qwen2-math-1.5b-instruct 75.2 90.6 22.7 4 1.2 0.81 0.87
mistralai_ministral_8b_instruct_2410 73.9 92.3 22.3 4 1.2 0.74 0.96
qwen3-1.7b 73.4 84.8 21.6 4 1.2 0.99 0.71
deepseek_r1_distill_llama_8b 73 92.9 22.5 4 1.2 0.71 1
qwen2-7b-instruct 71.7 91.9 21.6 4 1.2 0.73 1
qwen1.5-14b-chat 70.4 85.5 20.3 3 1.3 0.87 0.9
llama-3.2-3B-instruct 67.6 67.6 19.4 10 1.3 1.3 0
qwen2.5-coder-7b-instruct 65.4 89.5 19.2 4 1.3 0.74 1.1
mistralai_mathstral_7b_v0.1 63.4 89.8 18.2 4 1.3 0.73 1.1
deepseek_v2_lite_chat 62.1 75.2 17.1 2 1.3 0.89 1
mistralai_mixtral_8x7b_instruct_v0.1 60.8 81.9 17.1 3 1.3 0.84 1
deepseek_r1_distill_qwen_1.5b 59.8 86.6 17.3 4 1.4 0.72 1.1
qwen1.5-7b-chat 55.2 75.4 14.5 3 1.4 0.91 1
qwen2.5-coder-3b-instruct 54.9 82.1 14.6 4 1.4 0.81 1.1
google_codegemma_1.1_7b_it 50.3 79 13 5 1.4 0.89 1.1
google_gemma_3_1b_it 44.6 62.9 10.9 4 1.4 1.1 0.86
mistralai_mistral_7b_instruct_v0.3 43.4 72.2 11 4 1.4 0.82 1.1
qwen3-0.6b 38.2 64.8 8.87 5 1.3 0.92 0.98
qwen2.5-coder-1.5b-instruct 37.4 64.8 8.72 4 1.3 0.82 1
mistralai_mistral_7b_instruct_v0.2 36.8 62.4 8.82 4 1.3 0.86 1
llama-3.2-1B-instruct 29 29 6.29 13 1.2 1.2 0
google_gemma_7b_it 27.5 47.5 6.17 4 1.2 0.85 0.89
mistralai_mistral_7b_instruct_v0.1 22.5 47.8 4.79 4 1.1 0.63 0.96
qwen2-1.5b-instruct 20.5 47.8 4.43 4 1.1 0.53 0.98
qwen1.5-1.8b-chat 11.1 24.5 2.12 3 0.86 0.41 0.76
qwen2.5-coder-0.5b-instruct 9.83 29.8 1.87 5 0.82 0.38 0.73
google_gemma_2b_it 9.17 21.1 1.77 4 0.79 0.49 0.62
qwen2-0.5b-instruct 8.49 27.1 1.58 5 0.77 0.33 0.69
qwen1.5-0.5b-chat 3.12 12.7 0.622 5 0.48 0.14 0.46