gmat_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	92	95.7	42.1	3	2.8	2.1	1.9
qwen3-14b	90.2	95.7	40.6	3	3.1	2.2	2.2
qwen3-8b	86.2	94.6	38.5	3	3.6	2.5	2.6
qwen2.5-coder-32b-instruct	85.9	90.2	37.7	2	3.6	2.9	2.2
google_gemma_3_27b_it	83.3	88	35.6	3	3.9	3.5	1.8
qwen2-72b-instruct	82.1	90.2	36.1	2	4	2.7	3
deepseek_r1_distill_qwen_32b	81.5	88	36.4	2	4	3	2.7
qwen3-4b	81.2	92.4	35	4	4.1	3	2.8
deepseek_r1_distill_llama_70b	81	84.8	35.3	2	4.1	3.6	2
google_gemma_3_12b_it	78.8	89.1	33.4	4	4.3	3.4	2.5
deepseek_r1_distill_qwen_14b	78.5	87	34	4	4.3	3.5	2.5
google_gemma_2_27b_it	78.3	85.9	33.3	2	4.3	3.2	2.9
mistralai_mixtral_8x22b_instruct_v0.1	77.5	91.3	33.7	3	4.4	2.7	3.4
qwen2.5-coder-14b-instruct	76.1	89.1	32	3	4.4	3	3.3
google_gemma_2_9b_it	72.3	85.9	30.2	4	4.7	3.6	3
qwen1.5-72b-chat	66.8	80.4	28.2	2	4.9	3.1	3.8
qwen1.5-32b-chat	66.3	84.8	27.2	3	4.9	3.3	3.7
qwen2-math-72b-instruct	64.7	83.7	28	2	5	2	4.5
google_gemma_3_4b_it	64.1	82.6	26.1	5	5	3.7	3.4
qwen2.5-coder-7b-instruct	63	91.3	26	4	5	3	4.1
qwen3-1.7b	61.7	84.8	25.2	4	5.1	3.3	3.9
deepseek_r1_distill_qwen_7b	61.7	78.3	25.7	4	5.1	3.8	3.4
deepseek_r1_distill_llama_8b	59.2	79.3	23.3	4	5.1	3.6	3.6
llama-3.1-8B-instruct	58.7	58.7	24.7	7	5.1	5.1	0
mistralai_mixtral_8x7b_instruct_v0.1	54.7	79.3	22.6	3	5.2	3.2	4.1
qwen2-math-7b-instruct	54.3	70.7	22.8	2	5.2	3	4.2
qwen1.5-14b-chat	53.3	76.1	21.3	3	5.2	3.2	4.1
qwen2-7b-instruct	50.8	81.5	20	4	5.2	2.9	4.3
mistralai_mathstral_7b_v0.1	50.5	83.7	19.7	4	5.2	2.9	4.3
mistralai_ministral_8b_instruct_2410	50.5	76.1	19.3	4	5.2	3.3	4
mistralai_mistral_7b_instruct_v0.3	46.5	77.2	18	4	5.2	3.1	4.2
qwen3-0.6b	43.3	72.8	17.2	5	5.2	3.4	3.9
qwen2.5-coder-3b-instruct	39.7	76.1	15.7	4	5.1	2.4	4.5
deepseek_v2_lite_chat	39.5	62	15.4	3	5.1	3.1	4
mistralai_mistral_7b_instruct_v0.2	39.4	64.1	16	4	5.1	3.3	3.9
llama-3.2-3B-instruct	38	38	16.2	10	5.1	5.1	0
qwen1.5-7b-chat	38	63	15.1	3	5.1	2.8	4.2
google_gemma_7b_it	35.9	63	14.1	4	5	3.2	3.9
google_codegemma_1.1_7b_it	33.7	69.6	12.7	5	4.9	2.6	4.2
google_gemma_2b_it	32.1	40.2	11.8	4	4.9	4.3	2.3
mistralai_mistral_7b_instruct_v0.1	30.7	68.5	12.5	4	4.8	1.8	4.5
google_gemma_3_1b_it	29.9	58.7	11.7	4	4.8	2.8	3.8
deepseek_r1_distill_qwen_1.5b	26.4	52.2	10.2	4	4.6	2.5	3.8
qwen2-1.5b-instruct	26.4	59.8	10.8	4	4.6	1.9	4.2
qwen2.5-coder-1.5b-instruct	25.5	62	10.6	4	4.5	1.7	4.2
llama-3.2-1B-instruct	20.7	20.7	7.9	13	4.2	4.2	0
qwen2.5-coder-0.5b-instruct	12.6	47.8	5.66	5	3.5	0.53	3.4
qwen2-0.5b-instruct	12.2	46.7	5.53	5	3.4	0.53	3.4
qwen1.5-1.8b-chat	12	31.5	5.1	3	3.4	0.15	3.4
qwen2-math-1.5b-instruct	6.52	16.3	2.87	3	2.6	1.1	2.3
qwen1.5-0.5b-chat	5.65	25	2.64	5	2.4	0.085	2.4