mmlu_pro_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	69	39.3	2	0.42	0.34	0.24
qwen3-14b	67.2	37.8	2	0.43	0.37	0.22
qwen3-8b	62.5	34.2	2	0.44	0.38	0.23
llama-3.1-70B-instruct	60.3	33.2	3	0.45	0.45	0
qwen2-72b-instruct	60.1	32.7	1	0.45	NaN	NaN
qwen2.5-coder-32b-instruct	59.3	31.9	1	0.45	NaN	NaN
google_gemma_3_12b_it	58.4	31.2	3	0.45	0.38	0.24
qwen3-4b	57.7	31	3	0.45	0.39	0.23
mistralai_mixtral_8x22b_instruct_v0.1	47.1	23.9	2	0.46	0.31	0.33
deepseek_r1_distill_qwen_32b	46.6	23.2	1	0.45	NaN	NaN
qwen1.5-72b-chat	46.5	23.7	1	0.45	NaN	NaN
deepseek_r1_distill_qwen_14b	44.7	22	3	0.45	0.34	0.29
qwen2-math-72b-instruct	44.4	22.6	1	0.45	NaN	NaN
qwen2.5-coder-14b-instruct	43.3	21.5	2	0.45	0.3	0.34
qwen1.5-32b-chat	43.3	21.5	2	0.45	0.31	0.33
qwen3-1.7b	42.4	21.3	3	0.45	0.36	0.27
google_gemma_3_4b_it	41.5	20.2	4	0.45	0.37	0.26
qwen2-7b-instruct	41	20	3	0.45	0.32	0.32
mistralai_mixtral_8x7b_instruct_v0.1	38.4	19	2	0.44	0.3	0.33
llama-3.1-8B-instruct	37.5	18.1	7	0.44	0.44	0
qwen1.5-14b-chat	36	17.1	2	0.44	0.31	0.3
mistralai_ministral_8b_instruct_2410	32.8	15.4	3	0.43	0.27	0.33
deepseek_r1_distill_llama_8b	32.4	14.5	3	0.43	0.3	0.3
qwen2.5-coder-7b-instruct	32.1	15.2	3	0.43	0.26	0.33
deepseek_r1_distill_qwen_7b	31.5	14.5	3	0.42	0.3	0.3
mistralai_mistral_7b_instruct_v0.3	31.3	14.7	3	0.42	0.29	0.31
mistralai_mathstral_7b_v0.1	29.7	14	3	0.42	0.24	0.34
llama-3.2-3B-instruct	29.2	13.8	10	0.41	0.41	0
qwen2-math-7b-instruct	29	14.3	3	0.41	0.25	0.33
mistralai_mistral_7b_instruct_v0.2	27.9	12.9	3	0.41	0.29	0.29
google_codegemma_1.1_7b_it	27	12.7	1	0.4	NaN	NaN
deepseek_v2_lite_chat	26.1	12	2	0.4	0.25	0.31
qwen2.5-coder-3b-instruct	25.9	12	3	0.4	0.24	0.32
qwen1.5-7b-chat	23.1	10.5	3	0.38	0.23	0.31
qwen3-0.6b	23	11.6	4	0.38	0.26	0.28
mistralai_mistral_7b_instruct_v0.1	19	8.87	3	0.36	0.2	0.3
qwen2.5-coder-1.5b-instruct	16.9	8.11	3	0.34	0.17	0.3
qwen2-math-1.5b-instruct	16.6	8.31	3	0.34	0.17	0.29
llama-3.2-1B-instruct	16.5	8.09	12	0.34	0.34	0
deepseek_r1_distill_qwen_1.5b	15.9	7.17	3	0.33	0.18	0.28
qwen2-1.5b-instruct	11.6	5.82	3	0.29	0.12	0.27
qwen2-0.5b-instruct	9.51	5.55	4	0.27	0.091	0.25
qwen2.5-coder-0.5b-instruct	9.26	5.62	4	0.26	0.077	0.25
qwen1.5-1.8b-chat	8.84	4.47	3	0.26	0.11	0.23
qwen1.5-0.5b-chat	5.49	3.24	4	0.21	0.042	0.2