gsm8k_plus_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 82.1 31.3 10 0.37 0.33 0.18
qwen3-14b 81.2 30.6 10 0.38 0.35 0.15
qwen3-8b 78.9 29.1 10 0.4 0.35 0.18
llama-3.1-70B-instruct 78.3 29 12 0.4 0.4 0
deepseek_r1_distill_llama_70b 78.1 29.1 9 0.4 0.33 0.23
qwen2-72b-instruct 76.5 27.7 7 0.41 0.34 0.24
google_gemma_3_27b_it 75.1 26.5 10 0.42 0.39 0.15
deepseek_r1_distill_qwen_32b 75.1 28.6 10 0.42 0.31 0.29
qwen2-math-72b-instruct 74.9 26.6 10 0.42 0.35 0.24
qwen2.5-coder-32b-instruct 74.7 26.3 10 0.42 0.37 0.21
deepseek_r1_distill_qwen_14b 74.5 27.5 12 0.42 0.32 0.28
google_gemma_2_27b_it 74.3 26.3 9 0.43 0.36 0.22
google_gemma_3_12b_it 74.2 25.9 11 0.43 0.39 0.18
qwen3-4b 73.6 25.7 12 0.43 0.39 0.18
deepseek_r1_distill_qwen_7b 73.3 26.5 12 0.43 0.33 0.28
qwen2.5-coder-14b-instruct 70.9 24.2 10 0.44 0.35 0.27
google_gemma_2_9b_it 70.6 24.2 11 0.44 0.37 0.24
mistralai_mixtral_8x22b_instruct_v0.1 69.6 24.2 10 0.45 0.33 0.3
qwen1.5-32b-chat 67.1 22.7 10 0.46 0.34 0.3
qwen1.5-72b-chat 66.9 22.5 7 0.46 0.36 0.28
deepseek_r1_distill_llama_8b 66.8 23.5 12 0.46 0.32 0.33
qwen2-math-7b-instruct 65.5 21.4 12 0.46 0.36 0.28
mistralai_ministral_8b_instruct_2410 64.6 21 11 0.47 0.36 0.3
google_gemma_3_4b_it 64.3 20.4 13 0.47 0.41 0.22
qwen2-7b-instruct 64.2 20.9 12 0.47 0.35 0.31
llama-3.1-8B-instruct 64.1 21.2 16 0.47 0.47 0
qwen2-math-1.5b-instruct 61.5 19.2 12 0.47 0.38 0.28
qwen2.5-coder-7b-instruct 61.4 19.6 12 0.47 0.35 0.32
qwen3-1.7b 58.8 18.4 12 0.48 0.41 0.25
mistralai_mathstral_7b_v0.1 57.7 17.8 12 0.48 0.35 0.33
deepseek_r1_distill_qwen_1.5b 57 19.2 12 0.48 0.32 0.36
qwen1.5-14b-chat 55.7 16.9 10 0.48 0.37 0.32
llama-3.2-3B-instruct 55.3 16.9 18 0.48 0.48 0
deepseek_v2_lite_chat 49.9 14.5 10 0.49 0.36 0.33
qwen2.5-coder-3b-instruct 49.3 14.1 12 0.49 0.35 0.34
mistralai_mixtral_8x7b_instruct_v0.1 49.2 14.8 10 0.49 0.34 0.35
qwen1.5-7b-chat 43 12.2 11 0.48 0.33 0.35
google_codegemma_1.1_7b_it 37.5 9.58 13 0.47 0.35 0.32
mistralai_mistral_7b_instruct_v0.3 36.3 9.99 12 0.47 0.31 0.35
qwen2.5-coder-1.5b-instruct 34.9 8.96 12 0.46 0.32 0.34
qwen3-0.6b 30 7.67 13 0.45 0.33 0.3
mistralai_mistral_7b_instruct_v0.2 29.8 8.21 12 0.45 0.3 0.32
google_gemma_3_1b_it 29.6 7.25 12 0.44 0.35 0.27
qwen2-1.5b-instruct 25.8 6.56 12 0.43 0.26 0.34
llama-3.2-1B-instruct 24.3 5.75 22 0.42 0.42 0
mistralai_mistral_7b_instruct_v0.1 23.7 5.77 12 0.41 0.26 0.32
google_gemma_7b_it 19.1 4.54 12 0.38 0.28 0.27
qwen1.5-1.8b-chat 15.1 4.3 11 0.35 0.19 0.3
qwen2-0.5b-instruct 11.6 2.76 12 0.31 0.16 0.27
qwen2.5-coder-0.5b-instruct 9.04 2.19 13 0.28 0.15 0.24
google_gemma_2b_it 6.26 1.45 12 0.24 0.15 0.18
qwen1.5-0.5b-chat 4.29 1.22 13 0.2 0.083 0.18