gmat_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	92	42.1	3	2.8	2.1	1.9
qwen3-14b	90.2	40.6	3	3.1	2.2	2.2
qwen3-8b	86.2	38.5	3	3.6	2.5	2.6
qwen2.5-coder-32b-instruct	85.9	37.7	2	3.6	2.9	2.2
google_gemma_3_27b_it	83.3	35.6	3	3.9	3.5	1.8
qwen2-72b-instruct	82.1	36.1	2	4	2.7	3
deepseek_r1_distill_qwen_32b	81.5	36.4	2	4	3	2.7
qwen3-4b	81.2	35	4	4.1	3	2.8
deepseek_r1_distill_llama_70b	81	35.3	2	4.1	3.6	2
google_gemma_3_12b_it	78.8	33.4	4	4.3	3.4	2.5
deepseek_r1_distill_qwen_14b	78.5	34	4	4.3	3.5	2.5
google_gemma_2_27b_it	78.3	33.3	2	4.3	3.2	2.9
mistralai_mixtral_8x22b_instruct_v0.1	77.5	33.7	3	4.4	2.7	3.4
qwen2.5-coder-14b-instruct	76.1	32	3	4.4	3	3.3
google_gemma_2_9b_it	72.3	30.2	4	4.7	3.6	3
qwen1.5-72b-chat	66.8	28.2	2	4.9	3.1	3.8
qwen1.5-32b-chat	66.3	27.2	3	4.9	3.3	3.7
qwen2-math-72b-instruct	64.7	28	2	5	2	4.5
google_gemma_3_4b_it	64.1	26.1	5	5	3.7	3.4
qwen2.5-coder-7b-instruct	63	26	4	5	3	4.1
qwen3-1.7b	61.7	25.2	4	5.1	3.3	3.9
deepseek_r1_distill_qwen_7b	61.7	25.7	4	5.1	3.8	3.4
deepseek_r1_distill_llama_8b	59.2	23.3	4	5.1	3.6	3.6
llama-3.1-8B-instruct	58.7	24.7	7	5.1	5.1	0
mistralai_mixtral_8x7b_instruct_v0.1	54.7	22.6	3	5.2	3.2	4.1
qwen2-math-7b-instruct	54.3	22.8	2	5.2	3	4.2
qwen1.5-14b-chat	53.3	21.3	3	5.2	3.2	4.1
qwen2-7b-instruct	50.8	20	4	5.2	2.9	4.3
mistralai_mathstral_7b_v0.1	50.5	19.7	4	5.2	2.9	4.3
mistralai_ministral_8b_instruct_2410	50.5	19.3	4	5.2	3.3	4
mistralai_mistral_7b_instruct_v0.3	46.5	18	4	5.2	3.1	4.2
qwen3-0.6b	43.3	17.2	5	5.2	3.4	3.9
qwen2.5-coder-3b-instruct	39.7	15.7	4	5.1	2.4	4.5
deepseek_v2_lite_chat	39.5	15.4	3	5.1	3.1	4
mistralai_mistral_7b_instruct_v0.2	39.4	16	4	5.1	3.3	3.9
llama-3.2-3B-instruct	38	16.2	10	5.1	5.1	0
qwen1.5-7b-chat	38	15.1	3	5.1	2.8	4.2
google_gemma_7b_it	35.9	14.1	4	5	3.2	3.9
google_codegemma_1.1_7b_it	33.7	12.7	5	4.9	2.6	4.2
google_gemma_2b_it	32.1	11.8	4	4.9	4.3	2.3
mistralai_mistral_7b_instruct_v0.1	30.7	12.5	4	4.8	1.8	4.5
google_gemma_3_1b_it	29.9	11.7	4	4.8	2.8	3.8
deepseek_r1_distill_qwen_1.5b	26.4	10.2	4	4.6	2.5	3.8
qwen2-1.5b-instruct	26.4	10.8	4	4.6	1.9	4.2
qwen2.5-coder-1.5b-instruct	25.5	10.6	4	4.5	1.7	4.2
llama-3.2-1B-instruct	20.7	7.9	13	4.2	4.2	0
qwen2.5-coder-0.5b-instruct	12.6	5.66	5	3.5	0.53	3.4
qwen2-0.5b-instruct	12.2	5.53	5	3.4	0.53	3.4
qwen1.5-1.8b-chat	12	5.1	3	3.4	0.15	3.4
qwen2-math-1.5b-instruct	6.52	2.87	3	2.6	1.1	2.3
qwen1.5-0.5b-chat	5.65	2.64	5	2.4	0.085	2.4