gsm8k_plus_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 81.4 85.3 34.3 2 0.38 0.33 0.19
qwen3-14b 81.1 85.2 34 3 0.38 0.34 0.17
qwen3-8b 78.7 84.6 32.4 3 0.4 0.35 0.2
llama-3.1-70B-instruct 77.3 77.3 32 4 0.41 0.41 0
deepseek_r1_distill_llama_70b 76.2 76.2 31.4 1 0.41 NaN NaN
google_gemma_3_27b_it 75.1 79.1 30.1 3 0.42 0.39 0.16
qwen2-72b-instruct 74.4 81.9 29.9 2 0.42 0.33 0.27
google_gemma_3_12b_it 74.2 79.9 29.5 3 0.43 0.38 0.19
google_gemma_2_27b_it 74 74 29.5 1 0.43 NaN NaN
qwen3-4b 73.7 79.6 29.2 3 0.43 0.38 0.2
qwen2.5-coder-32b-instruct 73.5 79 29.2 2 0.43 0.36 0.23
deepseek_r1_distill_qwen_14b 71 85.4 28.9 3 0.44 0.31 0.32
google_gemma_2_9b_it 70.1 80.1 27.2 3 0.45 0.37 0.25
deepseek_r1_distill_qwen_32b 70.1 70.1 29.1 1 0.45 NaN NaN
deepseek_r1_distill_qwen_7b 68.6 83.4 27.2 3 0.45 0.32 0.32
qwen2.5-coder-14b-instruct 66.8 81.9 25.5 3 0.46 0.33 0.32
qwen2-math-72b-instruct 66.3 77.9 25.7 2 0.46 0.32 0.33
qwen1.5-72b-chat 65.7 75.5 25 2 0.46 0.35 0.3
qwen1.5-32b-chat 64.8 75.9 24.6 2 0.46 0.33 0.32
google_gemma_3_4b_it 64.1 74.6 23.4 4 0.47 0.4 0.24
mistralai_mixtral_8x22b_instruct_v0.1 63.2 76.3 24.1 2 0.47 0.31 0.35
qwen2-math-7b-instruct 62 77.8 22.9 3 0.47 0.35 0.32
deepseek_r1_distill_llama_8b 62 83.8 24.3 4 0.47 0.3 0.36
llama-3.1-8B-instruct 60.6 60.6 22.7 7 0.48 0.48 0
mistralai_ministral_8b_instruct_2410 59.7 77.6 21.8 3 0.48 0.33 0.34
qwen2-7b-instruct 59.2 77.6 21.6 3 0.48 0.33 0.35
qwen3-1.7b 58.6 69.4 21 3 0.48 0.4 0.26
qwen2-math-1.5b-instruct 56.8 73.3 20.2 3 0.48 0.36 0.33
qwen1.5-14b-chat 53.6 71.3 18.7 3 0.49 0.35 0.34
qwen2.5-coder-7b-instruct 51.9 73.8 18.4 3 0.49 0.31 0.38
llama-3.2-3B-instruct 49.6 49.6 17.2 10 0.49 0.49 0
deepseek_r1_distill_qwen_1.5b 49.6 76.8 18.8 4 0.49 0.29 0.39
mistralai_mathstral_7b_v0.1 47.5 70.6 16.4 3 0.49 0.3 0.38
deepseek_v2_lite_chat 46.7 59.5 15.6 2 0.49 0.34 0.35
mistralai_mixtral_8x7b_instruct_v0.1 44.8 66.7 15.3 3 0.48 0.31 0.37
qwen2.5-coder-3b-instruct 40.3 62.7 13.1 3 0.48 0.3 0.37
qwen1.5-7b-chat 40.2 61.1 13.1 3 0.48 0.32 0.36
google_codegemma_1.1_7b_it 33.5 56.6 10 4 0.46 0.31 0.34
mistralai_mistral_7b_instruct_v0.3 32.4 54.1 10.4 3 0.46 0.28 0.36
google_gemma_3_1b_it 28.9 45.6 8.36 4 0.44 0.34 0.28
qwen3-0.6b 28 49.1 8.4 4 0.44 0.3 0.32
mistralai_mistral_7b_instruct_v0.2 27.9 47 8.92 3 0.44 0.28 0.34
qwen2.5-coder-1.5b-instruct 26.5 47.3 8.12 3 0.43 0.25 0.35
llama-3.2-1B-instruct 19 19 5.62 12 0.38 0.38 0
google_gemma_7b_it 18.5 35.8 5.3 4 0.38 0.26 0.28
mistralai_mistral_7b_instruct_v0.1 16 32.9 5.07 3 0.36 0.18 0.31
qwen2-1.5b-instruct 15.2 32.5 5.14 3 0.35 0.16 0.31
qwen1.5-1.8b-chat 11.6 24.5 4.81 3 0.31 0.16 0.27
qwen2.5-coder-0.5b-instruct 7.1 20.8 2.39 4 0.25 0.11 0.23
qwen2-0.5b-instruct 6.61 20.3 2.56 4 0.24 0.087 0.23
google_gemma_2b_it 6.2 15.3 1.81 4 0.23 0.14 0.19
qwen1.5-0.5b-chat 5.41 14.5 3.2 4 0.22 0.11 0.19