gsm8k_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 94.4 32.6 3 0.63 0.56 0.29
qwen3-14b 93.5 32.2 3 0.68 0.59 0.34
qwen3-32b 93.4 32.2 2 0.69 0.53 0.43
llama-3.1-70B-instruct 92.9 31.9 4 0.71 0.71 0
google_gemma_3_12b_it 92.3 31.4 4 0.73 0.61 0.4
qwen3-8b 92 31 3 0.75 0.63 0.4
qwen2.5-coder-32b-instruct 91.3 30.8 2 0.78 0.56 0.54
deepseek_r1_distill_llama_70b 90.6 30.9 2 0.8 0.57 0.56
google_gemma_2_27b_it 89.8 29.6 2 0.83 0.68 0.48
qwen2-72b-instruct 89.5 30.1 2 0.84 0.56 0.63
qwen3-4b 89.5 29.6 4 0.85 0.76 0.38
google_gemma_2_9b_it 86.8 28 3 0.93 0.74 0.57
qwen2-math-72b-instruct 84.4 27.7 2 1 0.52 0.85
google_gemma_3_4b_it 83.3 26.3 5 1 0.86 0.57
qwen1.5-72b-chat 82.7 26.4 2 1 0.72 0.75
deepseek_r1_distill_qwen_7b 81.9 26.3 4 1.1 0.67 0.82
qwen1.5-32b-chat 81.7 26 2 1.1 0.7 0.8
deepseek_r1_distill_qwen_14b 81.3 26.4 4 1.1 0.72 0.8
qwen2.5-coder-14b-instruct 81 26.1 3 1.1 0.58 0.91
deepseek_r1_distill_qwen_32b 80.3 26.8 2 1.1 0.61 0.91
qwen2-math-7b-instruct 79.8 24.8 4 1.1 0.74 0.82
llama-3.1-8B-instruct 78.3 23.8 7 1.1 1.1 0
mistralai_mixtral_8x22b_instruct_v0.1 78.3 24.4 2 1.1 0.67 0.91
qwen2-math-1.5b-instruct 75.2 22.7 4 1.2 0.81 0.87
mistralai_ministral_8b_instruct_2410 73.9 22.3 4 1.2 0.74 0.96
qwen3-1.7b 73.4 21.6 4 1.2 0.99 0.71
deepseek_r1_distill_llama_8b 73 22.5 4 1.2 0.71 1
qwen2-7b-instruct 71.7 21.6 4 1.2 0.73 1
qwen1.5-14b-chat 70.4 20.3 3 1.3 0.87 0.9
llama-3.2-3B-instruct 67.6 19.4 10 1.3 1.3 0
qwen2.5-coder-7b-instruct 65.4 19.2 4 1.3 0.74 1.1
mistralai_mathstral_7b_v0.1 63.4 18.2 4 1.3 0.73 1.1
deepseek_v2_lite_chat 62.1 17.1 2 1.3 0.89 1
mistralai_mixtral_8x7b_instruct_v0.1 60.8 17.1 3 1.3 0.84 1
deepseek_r1_distill_qwen_1.5b 59.8 17.3 4 1.4 0.72 1.1
qwen1.5-7b-chat 55.2 14.5 3 1.4 0.91 1
qwen2.5-coder-3b-instruct 54.9 14.6 4 1.4 0.81 1.1
google_codegemma_1.1_7b_it 50.3 13 5 1.4 0.89 1.1
google_gemma_3_1b_it 44.6 10.9 4 1.4 1.1 0.86
mistralai_mistral_7b_instruct_v0.3 43.4 11 4 1.4 0.82 1.1
qwen3-0.6b 38.2 8.87 5 1.3 0.92 0.98
qwen2.5-coder-1.5b-instruct 37.4 8.72 4 1.3 0.82 1
mistralai_mistral_7b_instruct_v0.2 36.8 8.82 4 1.3 0.86 1
llama-3.2-1B-instruct 29 6.29 13 1.2 1.2 0
google_gemma_7b_it 27.5 6.17 4 1.2 0.85 0.89
mistralai_mistral_7b_instruct_v0.1 22.5 4.79 4 1.1 0.63 0.96
qwen2-1.5b-instruct 20.5 4.43 4 1.1 0.53 0.98
qwen1.5-1.8b-chat 11.1 2.12 3 0.86 0.41 0.76
qwen2.5-coder-0.5b-instruct 9.83 1.87 5 0.82 0.38 0.73
google_gemma_2b_it 9.17 1.77 4 0.79 0.49 0.62
qwen2-0.5b-instruct 8.49 1.58 5 0.77 0.33 0.69
qwen1.5-0.5b-chat 3.12 0.622 5 0.48 0.14 0.46