gsm8k_plus_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 81.4 34.3 2 0.38 0.33 0.19
qwen3-14b 81.1 34 3 0.38 0.34 0.17
qwen3-8b 78.7 32.4 3 0.4 0.35 0.2
llama-3.1-70B-instruct 77.3 32 4 0.41 0.41 0
deepseek_r1_distill_llama_70b 76.2 31.4 1 0.41 NaN NaN
google_gemma_3_27b_it 75.1 30.1 3 0.42 0.39 0.16
qwen2-72b-instruct 74.4 29.9 2 0.42 0.33 0.27
google_gemma_3_12b_it 74.2 29.5 3 0.43 0.38 0.19
google_gemma_2_27b_it 74 29.5 1 0.43 NaN NaN
qwen3-4b 73.7 29.2 3 0.43 0.38 0.2
qwen2.5-coder-32b-instruct 73.5 29.2 2 0.43 0.36 0.23
deepseek_r1_distill_qwen_14b 71 28.9 3 0.44 0.31 0.32
google_gemma_2_9b_it 70.1 27.2 3 0.45 0.37 0.25
deepseek_r1_distill_qwen_32b 70.1 29.1 1 0.45 NaN NaN
deepseek_r1_distill_qwen_7b 68.6 27.2 3 0.45 0.32 0.32
qwen2.5-coder-14b-instruct 66.8 25.5 3 0.46 0.33 0.32
qwen2-math-72b-instruct 66.3 25.7 2 0.46 0.32 0.33
qwen1.5-72b-chat 65.7 25 2 0.46 0.35 0.3
qwen1.5-32b-chat 64.8 24.6 2 0.46 0.33 0.32
google_gemma_3_4b_it 64.1 23.4 4 0.47 0.4 0.24
mistralai_mixtral_8x22b_instruct_v0.1 63.2 24.1 2 0.47 0.31 0.35
qwen2-math-7b-instruct 62 22.9 3 0.47 0.35 0.32
deepseek_r1_distill_llama_8b 62 24.3 4 0.47 0.3 0.36
llama-3.1-8B-instruct 60.6 22.7 7 0.48 0.48 0
mistralai_ministral_8b_instruct_2410 59.7 21.8 3 0.48 0.33 0.34
qwen2-7b-instruct 59.2 21.6 3 0.48 0.33 0.35
qwen3-1.7b 58.6 21 3 0.48 0.4 0.26
qwen2-math-1.5b-instruct 56.8 20.2 3 0.48 0.36 0.33
qwen1.5-14b-chat 53.6 18.7 3 0.49 0.35 0.34
qwen2.5-coder-7b-instruct 51.9 18.4 3 0.49 0.31 0.38
llama-3.2-3B-instruct 49.6 17.2 10 0.49 0.49 0
deepseek_r1_distill_qwen_1.5b 49.6 18.8 4 0.49 0.29 0.39
mistralai_mathstral_7b_v0.1 47.5 16.4 3 0.49 0.3 0.38
deepseek_v2_lite_chat 46.7 15.6 2 0.49 0.34 0.35
mistralai_mixtral_8x7b_instruct_v0.1 44.8 15.3 3 0.48 0.31 0.37
qwen2.5-coder-3b-instruct 40.3 13.1 3 0.48 0.3 0.37
qwen1.5-7b-chat 40.2 13.1 3 0.48 0.32 0.36
google_codegemma_1.1_7b_it 33.5 10 4 0.46 0.31 0.34
mistralai_mistral_7b_instruct_v0.3 32.4 10.4 3 0.46 0.28 0.36
google_gemma_3_1b_it 28.9 8.36 4 0.44 0.34 0.28
qwen3-0.6b 28 8.4 4 0.44 0.3 0.32
mistralai_mistral_7b_instruct_v0.2 27.9 8.92 3 0.44 0.28 0.34
qwen2.5-coder-1.5b-instruct 26.5 8.12 3 0.43 0.25 0.35
llama-3.2-1B-instruct 19 5.62 12 0.38 0.38 0
google_gemma_7b_it 18.5 5.3 4 0.38 0.26 0.28
mistralai_mistral_7b_instruct_v0.1 16 5.07 3 0.36 0.18 0.31
qwen2-1.5b-instruct 15.2 5.14 3 0.35 0.16 0.31
qwen1.5-1.8b-chat 11.6 4.81 3 0.31 0.16 0.27
qwen2.5-coder-0.5b-instruct 7.1 2.39 4 0.25 0.11 0.23
qwen2-0.5b-instruct 6.61 2.56 4 0.24 0.087 0.23
google_gemma_2b_it 6.2 1.81 4 0.23 0.14 0.19
qwen1.5-0.5b-chat 5.41 3.2 4 0.22 0.11 0.19