mmlu_pro_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	70	36.4	9	0.42	0.35	0.23
qwen3-14b	67.3	34.4	10	0.43	0.37	0.21
llama-3.1-70B-instruct	63.9	32.3	12	0.44	0.44	0
qwen3-8b	62.6	31	10	0.44	0.38	0.22
qwen2-72b-instruct	62.2	31	4	0.44	0.35	0.27
qwen2.5-coder-32b-instruct	60.5	29.5	8	0.45	0.36	0.26
google_gemma_3_12b_it	58.7	28.2	11	0.45	0.38	0.23
deepseek_r1_distill_llama_70b	57.7	27.6	9	0.45	0.37	0.25
qwen3-4b	57.6	27.8	11	0.45	0.39	0.23
mistralai_mixtral_8x22b_instruct_v0.1	51.8	24	9	0.46	0.34	0.31
deepseek_r1_distill_qwen_32b	51.5	23.6	8	0.46	0.36	0.28
qwen2-math-72b-instruct	51.1	23.8	8	0.46	0.34	0.31
deepseek_r1_distill_qwen_14b	48.1	21.5	12	0.46	0.35	0.29
qwen1.5-72b-chat	47.5	21.7	1	0.46	NaN	NaN
qwen2.5-coder-14b-instruct	47.4	21.3	10	0.46	0.33	0.32
qwen1.5-32b-chat	46.1	20.6	8	0.45	0.34	0.31
llama-3.1-8B-instruct	45	19.9	15	0.45	0.45	0
qwen2-7b-instruct	44.2	19.5	12	0.45	0.34	0.3
qwen3-1.7b	42.9	19.2	12	0.45	0.37	0.26
mistralai_mixtral_8x7b_instruct_v0.1	42.3	18.8	10	0.45	0.33	0.31
google_gemma_3_4b_it	41.5	17.9	13	0.45	0.38	0.25
mistralai_ministral_8b_instruct_2410	39.2	16.7	11	0.45	0.31	0.32
qwen1.5-14b-chat	37.9	16	10	0.44	0.33	0.29
qwen2.5-coder-7b-instruct	37.7	16	12	0.44	0.3	0.32
mistralai_mathstral_7b_v0.1	36.7	15.7	12	0.44	0.29	0.33
deepseek_r1_distill_llama_8b	36.1	14.7	12	0.44	0.32	0.3
deepseek_r1_distill_qwen_7b	36.1	15.1	12	0.44	0.33	0.29
llama-3.2-3B-instruct	35	15	19	0.43	0.43	0
mistralai_mistral_7b_instruct_v0.3	33.7	14.3	12	0.43	0.31	0.3
qwen2-math-7b-instruct	33.1	14.6	12	0.43	0.3	0.31
qwen2.5-coder-3b-instruct	29.4	12.1	12	0.42	0.27	0.31
mistralai_mistral_7b_instruct_v0.2	29.1	11.9	12	0.41	0.3	0.28
deepseek_v2_lite_chat	29	11.9	10	0.41	0.28	0.3
qwen1.5-7b-chat	25.1	10.1	10	0.4	0.26	0.3
qwen2-math-1.5b-instruct	25.1	11.5	12	0.4	0.25	0.31
qwen3-0.6b	23.8	10.7	13	0.39	0.29	0.26
mistralai_mistral_7b_instruct_v0.1	23.8	9.96	12	0.39	0.25	0.3
llama-3.2-1B-instruct	21.5	9.44	21	0.37	0.37	0
deepseek_r1_distill_qwen_1.5b	20.5	8.21	12	0.37	0.22	0.29
qwen2.5-coder-1.5b-instruct	20.3	8.48	12	0.37	0.22	0.29
qwen2-1.5b-instruct	17.2	7.59	12	0.34	0.18	0.29
qwen1.5-1.8b-chat	12.4	5.71	10	0.3	0.16	0.25
qwen2-0.5b-instruct	11.7	6.16	13	0.29	0.13	0.26
qwen2.5-coder-0.5b-instruct	10.4	5.79	13	0.28	0.1	0.26
qwen1.5-0.5b-chat	10.3	5.83	13	0.28	0.098	0.26