mmlu_pro_cot: by examples

Results Paper Code

Not solved by any model

There are 8 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
11801, 3265, 5029, 5528, 5529, 5724, 9319, 9551

Problems solved by 1 model only

example_link	model	min_pass1_of_model
5706	qwen2-math-72b-instruct	0.511
11637	qwen2.5-coder-7b-instruct	0.377
6355	qwen2-math-7b-instruct	0.331
1325	qwen2-math-1.5b-instruct	0.251
6663	qwen2-math-1.5b-instruct	0.251
5324	qwen3-0.6b	0.238
1402	qwen3-0.6b	0.238
6232	mistralai_mistral_7b_instruct_v0.1	0.238
9571	mistralai_mistral_7b_instruct_v0.1	0.238
5367	deepseek_r1_distill_qwen_1.5b	0.205
1468	deepseek_r1_distill_qwen_1.5b	0.205
1789	qwen2.5-coder-1.5b-instruct	0.203
9649	qwen2-1.5b-instruct	0.172
7254	qwen2-1.5b-instruct	0.172
11641	qwen1.5-1.8b-chat	0.124
11422	qwen2.5-coder-0.5b-instruct	0.104
3282	qwen2.5-coder-0.5b-instruct	0.104
3519	qwen2.5-coder-0.5b-instruct	0.104
8851	qwen2.5-coder-0.5b-instruct	0.104
5081	qwen2.5-coder-0.5b-instruct	0.104
2677	qwen1.5-0.5b-chat	0.103
4202	qwen1.5-0.5b-chat	0.103
7386	qwen1.5-0.5b-chat	0.103
8042	qwen1.5-0.5b-chat	0.103
9880	qwen1.5-0.5b-chat	0.103

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
10838	0.096	-0.605
4881	0.087	-0.563
5495	0.158	-0.563
9882	0.080	-0.552
6377	0.045	-0.543
3730	0.083	-0.539
8526	0.090	-0.537
5889	0.069	-0.526
7595	0.027	-0.519
91	0.131	-0.517

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.