mmlu_pro_cot: by examples

Results Paper Code

Not solved by any model

There are 52 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1017, 11277, 11635, 11741, 11801, 1402, 1468, 2677, 2810, 2909, 3265, 3282, 3528, 3570, 4146, 4202, 4499, 4651, 4689, 4736, 4966, 5072, 5348, 5529, 56, 5724, 5857, 5895, 5997, 6063, 6081, 6117, 6177, 6232, 6343, 6355, 6600, 6769, 6859, 6949, 7006, 7145, 8249, 8754, 8851, 8880, 9324, 9571, 9649, 9902, 9917, 999

Problems solved by 1 model only

example_link	model	min_pass1_of_model
11402	qwen3-32b	0.690
2848	qwen3-14b	0.672
4945	qwen3-14b	0.672
128	qwen3-8b	0.625
6049	llama-3.1-70B-instruct	0.603
9565	llama-3.1-70B-instruct	0.603
5833	llama-3.1-70B-instruct	0.603
5956	llama-3.1-70B-instruct	0.603
7386	llama-3.1-70B-instruct	0.603
9320	llama-3.1-70B-instruct	0.603
11749	qwen2-72b-instruct	0.601
6341	qwen2-72b-instruct	0.601
686	qwen2-72b-instruct	0.601
5426	qwen2-72b-instruct	0.601
5537	qwen2.5-coder-32b-instruct	0.593
1342	google_gemma_3_12b_it	0.584
10264	google_gemma_3_12b_it	0.584
8567	deepseek_r1_distill_qwen_14b	0.447
5029	qwen2-math-72b-instruct	0.444
4835	qwen1.5-32b-chat	0.433
9142	qwen3-1.7b	0.424
5054	qwen3-1.7b	0.424
11313	qwen3-1.7b	0.424
11533	qwen3-1.7b	0.424
6109	qwen3-1.7b	0.424
8062	google_gemma_3_4b_it	0.415
10613	qwen2-7b-instruct	0.410
4654	qwen1.5-14b-chat	0.360
8042	mistralai_ministral_8b_instruct_2410	0.328
8212	mistralai_ministral_8b_instruct_2410	0.328
5694	mistralai_ministral_8b_instruct_2410	0.328
221	qwen2.5-coder-7b-instruct	0.321
11637	qwen2.5-coder-7b-instruct	0.321
7121	qwen2.5-coder-7b-instruct	0.321
1820	mistralai_mistral_7b_instruct_v0.3	0.313
10914	llama-3.2-3B-instruct	0.292
6906	llama-3.2-3B-instruct	0.292
6329	qwen2-math-7b-instruct	0.290
3954	qwen2-math-7b-instruct	0.290
2223	qwen2-math-7b-instruct	0.290
6663	qwen2-math-7b-instruct	0.290
10879	qwen2-math-7b-instruct	0.290
4967	qwen2-math-7b-instruct	0.290
5367	mistralai_mistral_7b_instruct_v0.2	0.279
2682	mistralai_mistral_7b_instruct_v0.2	0.279
6948	deepseek_v2_lite_chat	0.261
4382	qwen2.5-coder-3b-instruct	0.259
5797	qwen1.5-7b-chat	0.231
9551	qwen3-0.6b	0.230
8432	qwen3-0.6b	0.230
9820	qwen3-0.6b	0.230
8968	qwen3-0.6b	0.230
11484	qwen3-0.6b	0.230
9208	mistralai_mistral_7b_instruct_v0.1	0.190
11205	mistralai_mistral_7b_instruct_v0.1	0.190
4183	mistralai_mistral_7b_instruct_v0.1	0.190
6100	mistralai_mistral_7b_instruct_v0.1	0.190
9502	mistralai_mistral_7b_instruct_v0.1	0.190
5568	qwen2.5-coder-1.5b-instruct	0.169
9737	qwen2.5-coder-1.5b-instruct	0.169
11688	qwen2.5-coder-1.5b-instruct	0.169
11981	qwen2.5-coder-1.5b-instruct	0.169
8672	qwen2-math-1.5b-instruct	0.166
9880	qwen2-math-1.5b-instruct	0.166
5809	qwen2-math-1.5b-instruct	0.166
6095	qwen2-math-1.5b-instruct	0.166
3519	qwen2-math-1.5b-instruct	0.166
6192	qwen2-math-1.5b-instruct	0.166
9661	qwen2-math-1.5b-instruct	0.166
11451	qwen2-math-1.5b-instruct	0.166
5994	qwen2-math-1.5b-instruct	0.166
11274	llama-3.2-1B-instruct	0.165
4485	llama-3.2-1B-instruct	0.165
9782	llama-3.2-1B-instruct	0.165
5983	deepseek_r1_distill_qwen_1.5b	0.159
5639	deepseek_r1_distill_qwen_1.5b	0.159
4950	qwen2-1.5b-instruct	0.116
11394	qwen2-1.5b-instruct	0.116
11422	qwen2-1.5b-instruct	0.116
6019	qwen2-0.5b-instruct	0.095
5031	qwen2-0.5b-instruct	0.095
5185	qwen2-0.5b-instruct	0.095
11440	qwen2-0.5b-instruct	0.095
1329	qwen2-0.5b-instruct	0.095
8138	qwen2-0.5b-instruct	0.095
1581	qwen2-0.5b-instruct	0.095
6818	qwen2-0.5b-instruct	0.095
9853	qwen2-0.5b-instruct	0.095
5944	qwen2.5-coder-0.5b-instruct	0.093
1325	qwen2.5-coder-0.5b-instruct	0.093
1303	qwen2.5-coder-0.5b-instruct	0.093
9739	qwen2.5-coder-0.5b-instruct	0.093
7036	qwen2.5-coder-0.5b-instruct	0.093
2700	qwen2.5-coder-0.5b-instruct	0.093
6363	qwen2.5-coder-0.5b-instruct	0.093
6805	qwen1.5-1.8b-chat	0.088
616	qwen1.5-1.8b-chat	0.088
13	qwen1.5-1.8b-chat	0.088
6742	qwen1.5-1.8b-chat	0.088
9832	qwen1.5-0.5b-chat	0.055
5528	qwen1.5-0.5b-chat	0.055
5865	qwen1.5-0.5b-chat	0.055
5144	qwen1.5-0.5b-chat	0.055
5349	qwen1.5-0.5b-chat	0.055
7241	qwen1.5-0.5b-chat	0.055

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
5495	0.111	-0.494
3512	0.039	-0.422
9882	0.117	-0.422
5056	0.072	-0.417
956	0.024	-0.406
4691	0.094	-0.403
350	0.070	-0.401
425	0.091	-0.394
4192	0.124	-0.388
8163	0.026	-0.387

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.