gmat_cot: by models

Home


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 94.2 100 40.7 11 2.4 1.8 1.6
qwen3-14b 90.9 98.9 38.2 12 3 2.4 1.8
qwen3-8b 87 100 36.3 12 3.5 2.6 2.4
qwen2.5-coder-32b-instruct 87 95.7 35.5 11 3.5 2.9 2
qwen2-72b-instruct 84.8 96.7 34.8 11 3.7 2.8 2.4
google_gemma_3_27b_it 84 92.4 33.3 12 3.8 3.4 1.8
deepseek_r1_distill_qwen_32b 83.6 92.4 34.6 11 3.9 3.1 2.3
deepseek_r1_distill_llama_70b 83.1 92.4 34.4 11 3.9 3.2 2.2
deepseek_r1_distill_qwen_14b 80.6 92.4 32.7 11 4.1 3.3 2.5
qwen3-4b 80 93.5 31.3 12 4.2 3.3 2.6
google_gemma_3_12b_it 78.5 87 30.7 11 4.3 3.8 2
qwen2.5-coder-14b-instruct 78.2 96.7 30.7 12 4.3 3 3
google_gemma_2_27b_it 75 94.6 29.4 10 4.5 3.4 2.9
qwen2-math-72b-instruct 74.6 100 29.9 11 4.5 2.7 3.7
mistralai_mixtral_8x22b_instruct_v0.1 74 93.5 29.7 11 4.6 3.3 3.2
qwen1.5-72b-chat 70.6 92.4 27.1 11 4.8 3.3 3.4
qwen1.5-32b-chat 70.3 93.5 26.8 11 4.8 3.3 3.4
google_gemma_2_9b_it 69.9 89.1 26.3 11 4.8 3.6 3.2
deepseek_r1_distill_qwen_7b 67.3 92.4 25.8 11 4.9 3.6 3.3
deepseek_r1_distill_llama_8b 66.8 92.4 25.5 13 4.9 3.4 3.5
google_gemma_3_4b_it 66.1 87 24.8 13 4.9 3.8 3.2
llama-3.1-8B-instruct 65.2 65.2 24.2 15 5 5 0
qwen2.5-coder-7b-instruct 62.5 95.7 23.3 10 5 3.2 3.9
qwen3-1.7b 60.3 90.2 22.1 12 5.1 3.6 3.7
mistralai_mathstral_7b_v0.1 58.3 91.3 21.6 11 5.1 3.3 4
qwen1.5-14b-chat 57.5 89.1 21 12 5.2 3.5 3.8
llama-3.2-3B-instruct 56.5 56.5 20.1 17 5.2 5.2 0
qwen2-math-7b-instruct 55.4 85.9 20.5 6 5.2 3.3 4
qwen2-7b-instruct 55.3 92.4 19.9 11 5.2 3.1 4.2
mistralai_mixtral_8x7b_instruct_v0.1 53.9 92.4 20.7 12 5.2 3.2 4.1
mistralai_ministral_8b_instruct_2410 53.8 95.7 19.7 11 5.2 2.9 4.3
qwen3-0.6b 47.2 84.8 17.2 13 5.2 3.7 3.7
mistralai_mistral_7b_instruct_v0.3 45.8 85.9 16.6 11 5.2 3.2 4.1
mistralai_mistral_7b_instruct_v0.2 42.7 78.3 15.6 10 5.2 3.4 3.9
qwen2.5-coder-3b-instruct 42.2 87 14.2 12 5.1 3.1 4.1
deepseek_v2_lite_chat 41.7 78.3 14.8 11 5.1 3.4 3.8
qwen1.5-7b-chat 36.9 90.2 13.3 12 5 2.9 4.1
google_codegemma_1.1_7b_it 36.2 73.9 12 13 5 3.3 3.8
google_gemma_7b_it 35.7 67.4 12.7 13 5 3.6 3.4
google_gemma_3_1b_it 34.8 78.3 11.9 13 5 3.2 3.8
mistralai_mistral_7b_instruct_v0.1 33.9 83.7 11.8 11 4.9 2.8 4
google_gemma_2b_it 33.3 43.5 10.8 13 4.9 4.6 1.8
qwen2-1.5b-instruct 28.6 72.8 10 13 4.7 2.7 3.9
deepseek_r1_distill_qwen_1.5b 28.3 81.5 10.2 13 4.7 2.3 4.1
qwen2.5-coder-1.5b-instruct 28.3 81.5 9.78 11 4.7 2.4 4.1
llama-3.2-1B-instruct 25 25 9.95 12 4.5 4.5 0
qwen1.5-1.8b-chat 19.4 73.9 7.2 11 4.1 1.9 3.7
qwen2-0.5b-instruct 15.9 76.1 6.2 13 3.8 1.2 3.6
qwen2.5-coder-0.5b-instruct 13.4 80.4 5.35 13 3.5 0.58 3.5
qwen2-math-1.5b-instruct 12.2 29.3 4.41 4 3.4 1.9 2.8
qwen1.5-0.5b-chat 9.95 59.8 4.46 13 3.1 0.79 3