gmat_cot: by models

Home

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	94.2	100	40.7	11	2.4	1.8	1.6
qwen3-14b	90.9	98.9	38.2	12	3	2.4	1.8
qwen3-8b	87	100	36.3	12	3.5	2.6	2.4
qwen2.5-coder-32b-instruct	87	95.7	35.5	11	3.5	2.9	2
qwen2-72b-instruct	84.8	96.7	34.8	11	3.7	2.8	2.4
google_gemma_3_27b_it	84	92.4	33.3	12	3.8	3.4	1.8
deepseek_r1_distill_qwen_32b	83.6	92.4	34.6	11	3.9	3.1	2.3
deepseek_r1_distill_llama_70b	83.1	92.4	34.4	11	3.9	3.2	2.2
deepseek_r1_distill_qwen_14b	80.6	92.4	32.7	11	4.1	3.3	2.5
qwen3-4b	80	93.5	31.3	12	4.2	3.3	2.6
google_gemma_3_12b_it	78.5	87	30.7	11	4.3	3.8	2
qwen2.5-coder-14b-instruct	78.2	96.7	30.7	12	4.3	3	3
google_gemma_2_27b_it	75	94.6	29.4	10	4.5	3.4	2.9
qwen2-math-72b-instruct	74.6	100	29.9	11	4.5	2.7	3.7
mistralai_mixtral_8x22b_instruct_v0.1	74	93.5	29.7	11	4.6	3.3	3.2
qwen1.5-72b-chat	70.6	92.4	27.1	11	4.8	3.3	3.4
qwen1.5-32b-chat	70.3	93.5	26.8	11	4.8	3.3	3.4
google_gemma_2_9b_it	69.9	89.1	26.3	11	4.8	3.6	3.2
deepseek_r1_distill_qwen_7b	67.3	92.4	25.8	11	4.9	3.6	3.3
deepseek_r1_distill_llama_8b	66.8	92.4	25.5	13	4.9	3.4	3.5
google_gemma_3_4b_it	66.1	87	24.8	13	4.9	3.8	3.2
llama-3.1-8B-instruct	65.2	65.2	24.2	15	5	5	0
qwen2.5-coder-7b-instruct	62.5	95.7	23.3	10	5	3.2	3.9
qwen3-1.7b	60.3	90.2	22.1	12	5.1	3.6	3.7
mistralai_mathstral_7b_v0.1	58.3	91.3	21.6	11	5.1	3.3	4
qwen1.5-14b-chat	57.5	89.1	21	12	5.2	3.5	3.8
llama-3.2-3B-instruct	56.5	56.5	20.1	17	5.2	5.2	0
qwen2-math-7b-instruct	55.4	85.9	20.5	6	5.2	3.3	4
qwen2-7b-instruct	55.3	92.4	19.9	11	5.2	3.1	4.2
mistralai_mixtral_8x7b_instruct_v0.1	53.9	92.4	20.7	12	5.2	3.2	4.1
mistralai_ministral_8b_instruct_2410	53.8	95.7	19.7	11	5.2	2.9	4.3
qwen3-0.6b	47.2	84.8	17.2	13	5.2	3.7	3.7
mistralai_mistral_7b_instruct_v0.3	45.8	85.9	16.6	11	5.2	3.2	4.1
mistralai_mistral_7b_instruct_v0.2	42.7	78.3	15.6	10	5.2	3.4	3.9
qwen2.5-coder-3b-instruct	42.2	87	14.2	12	5.1	3.1	4.1
deepseek_v2_lite_chat	41.7	78.3	14.8	11	5.1	3.4	3.8
qwen1.5-7b-chat	36.9	90.2	13.3	12	5	2.9	4.1
google_codegemma_1.1_7b_it	36.2	73.9	12	13	5	3.3	3.8
google_gemma_7b_it	35.7	67.4	12.7	13	5	3.6	3.4
google_gemma_3_1b_it	34.8	78.3	11.9	13	5	3.2	3.8
mistralai_mistral_7b_instruct_v0.1	33.9	83.7	11.8	11	4.9	2.8	4
google_gemma_2b_it	33.3	43.5	10.8	13	4.9	4.6	1.8
qwen2-1.5b-instruct	28.6	72.8	10	13	4.7	2.7	3.9
deepseek_r1_distill_qwen_1.5b	28.3	81.5	10.2	13	4.7	2.3	4.1
qwen2.5-coder-1.5b-instruct	28.3	81.5	9.78	11	4.7	2.4	4.1
llama-3.2-1B-instruct	25	25	9.95	12	4.5	4.5	0
qwen1.5-1.8b-chat	19.4	73.9	7.2	11	4.1	1.9	3.7
qwen2-0.5b-instruct	15.9	76.1	6.2	13	3.8	1.2	3.6
qwen2.5-coder-0.5b-instruct	13.4	80.4	5.35	13	3.5	0.58	3.5
qwen2-math-1.5b-instruct	12.2	29.3	4.41	4	3.4	1.9	2.8
qwen1.5-0.5b-chat	9.95	59.8	4.46	13	3.1	0.79	3