gmat_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 92 95.7 42.1 3 2.8 2.1 1.9
qwen3-14b 90.2 95.7 40.6 3 3.1 2.2 2.2
qwen3-8b 86.2 94.6 38.5 3 3.6 2.5 2.6
qwen2.5-coder-32b-instruct 85.9 90.2 37.7 2 3.6 2.9 2.2
google_gemma_3_27b_it 83.3 88 35.6 3 3.9 3.5 1.8
qwen2-72b-instruct 82.1 90.2 36.1 2 4 2.7 3
deepseek_r1_distill_qwen_32b 81.5 88 36.4 2 4 3 2.7
qwen3-4b 81.2 92.4 35 4 4.1 3 2.8
deepseek_r1_distill_llama_70b 81 84.8 35.3 2 4.1 3.6 2
google_gemma_3_12b_it 78.8 89.1 33.4 4 4.3 3.4 2.5
deepseek_r1_distill_qwen_14b 78.5 87 34 4 4.3 3.5 2.5
google_gemma_2_27b_it 78.3 85.9 33.3 2 4.3 3.2 2.9
mistralai_mixtral_8x22b_instruct_v0.1 77.5 91.3 33.7 3 4.4 2.7 3.4
qwen2.5-coder-14b-instruct 76.1 89.1 32 3 4.4 3 3.3
google_gemma_2_9b_it 72.3 85.9 30.2 4 4.7 3.6 3
qwen1.5-72b-chat 66.8 80.4 28.2 2 4.9 3.1 3.8
qwen1.5-32b-chat 66.3 84.8 27.2 3 4.9 3.3 3.7
qwen2-math-72b-instruct 64.7 83.7 28 2 5 2 4.5
google_gemma_3_4b_it 64.1 82.6 26.1 5 5 3.7 3.4
qwen2.5-coder-7b-instruct 63 91.3 26 4 5 3 4.1
qwen3-1.7b 61.7 84.8 25.2 4 5.1 3.3 3.9
deepseek_r1_distill_qwen_7b 61.7 78.3 25.7 4 5.1 3.8 3.4
deepseek_r1_distill_llama_8b 59.2 79.3 23.3 4 5.1 3.6 3.6
llama-3.1-8B-instruct 58.7 58.7 24.7 7 5.1 5.1 0
mistralai_mixtral_8x7b_instruct_v0.1 54.7 79.3 22.6 3 5.2 3.2 4.1
qwen2-math-7b-instruct 54.3 70.7 22.8 2 5.2 3 4.2
qwen1.5-14b-chat 53.3 76.1 21.3 3 5.2 3.2 4.1
qwen2-7b-instruct 50.8 81.5 20 4 5.2 2.9 4.3
mistralai_mathstral_7b_v0.1 50.5 83.7 19.7 4 5.2 2.9 4.3
mistralai_ministral_8b_instruct_2410 50.5 76.1 19.3 4 5.2 3.3 4
mistralai_mistral_7b_instruct_v0.3 46.5 77.2 18 4 5.2 3.1 4.2
qwen3-0.6b 43.3 72.8 17.2 5 5.2 3.4 3.9
qwen2.5-coder-3b-instruct 39.7 76.1 15.7 4 5.1 2.4 4.5
deepseek_v2_lite_chat 39.5 62 15.4 3 5.1 3.1 4
mistralai_mistral_7b_instruct_v0.2 39.4 64.1 16 4 5.1 3.3 3.9
llama-3.2-3B-instruct 38 38 16.2 10 5.1 5.1 0
qwen1.5-7b-chat 38 63 15.1 3 5.1 2.8 4.2
google_gemma_7b_it 35.9 63 14.1 4 5 3.2 3.9
google_codegemma_1.1_7b_it 33.7 69.6 12.7 5 4.9 2.6 4.2
google_gemma_2b_it 32.1 40.2 11.8 4 4.9 4.3 2.3
mistralai_mistral_7b_instruct_v0.1 30.7 68.5 12.5 4 4.8 1.8 4.5
google_gemma_3_1b_it 29.9 58.7 11.7 4 4.8 2.8 3.8
deepseek_r1_distill_qwen_1.5b 26.4 52.2 10.2 4 4.6 2.5 3.8
qwen2-1.5b-instruct 26.4 59.8 10.8 4 4.6 1.9 4.2
qwen2.5-coder-1.5b-instruct 25.5 62 10.6 4 4.5 1.7 4.2
llama-3.2-1B-instruct 20.7 20.7 7.9 13 4.2 4.2 0
qwen2.5-coder-0.5b-instruct 12.6 47.8 5.66 5 3.5 0.53 3.4
qwen2-0.5b-instruct 12.2 46.7 5.53 5 3.4 0.53 3.4
qwen1.5-1.8b-chat 12 31.5 5.1 3 3.4 0.15 3.4
qwen2-math-1.5b-instruct 6.52 16.3 2.87 3 2.6 1.1 2.3
qwen1.5-0.5b-chat 5.65 25 2.64 5 2.4 0.085 2.4