mmlu_pro_cot: by examples

Results Paper Code


Not solved by any model

There are 8 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
11801, 3265, 5029, 5528, 5529, 5724, 9319, 9551

Problems solved by 1 model only

example_link model min_pass1_of_model
5706 qwen2-math-72b-instruct 0.511
11637 qwen2.5-coder-7b-instruct 0.377
6355 qwen2-math-7b-instruct 0.331
1325 qwen2-math-1.5b-instruct 0.251
6663 qwen2-math-1.5b-instruct 0.251
5324 qwen3-0.6b 0.238
1402 qwen3-0.6b 0.238
6232 mistralai_mistral_7b_instruct_v0.1 0.238
9571 mistralai_mistral_7b_instruct_v0.1 0.238
5367 deepseek_r1_distill_qwen_1.5b 0.205
1468 deepseek_r1_distill_qwen_1.5b 0.205
1789 qwen2.5-coder-1.5b-instruct 0.203
9649 qwen2-1.5b-instruct 0.172
7254 qwen2-1.5b-instruct 0.172
11641 qwen1.5-1.8b-chat 0.124
11422 qwen2.5-coder-0.5b-instruct 0.104
3282 qwen2.5-coder-0.5b-instruct 0.104
3519 qwen2.5-coder-0.5b-instruct 0.104
8851 qwen2.5-coder-0.5b-instruct 0.104
5081 qwen2.5-coder-0.5b-instruct 0.104
2677 qwen1.5-0.5b-chat 0.103
4202 qwen1.5-0.5b-chat 0.103
7386 qwen1.5-0.5b-chat 0.103
8042 qwen1.5-0.5b-chat 0.103
9880 qwen1.5-0.5b-chat 0.103

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
10838 0.096 -0.605
4881 0.087 -0.563
5495 0.158 -0.563
9882 0.080 -0.552
6377 0.045 -0.543
3730 0.083 -0.539
8526 0.090 -0.537
5889 0.069 -0.526
7595 0.027 -0.519
91 0.131 -0.517

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.