mmlu_pro_cot: by examples

Results Paper Code


Not solved by any model

There are 52 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1017, 11277, 11635, 11741, 11801, 1402, 1468, 2677, 2810, 2909, 3265, 3282, 3528, 3570, 4146, 4202, 4499, 4651, 4689, 4736, 4966, 5072, 5348, 5529, 56, 5724, 5857, 5895, 5997, 6063, 6081, 6117, 6177, 6232, 6343, 6355, 6600, 6769, 6859, 6949, 7006, 7145, 8249, 8754, 8851, 8880, 9324, 9571, 9649, 9902, 9917, 999

Problems solved by 1 model only

example_link model min_pass1_of_model
11402 qwen3-32b 0.690
2848 qwen3-14b 0.672
4945 qwen3-14b 0.672
128 qwen3-8b 0.625
6049 llama-3.1-70B-instruct 0.603
9565 llama-3.1-70B-instruct 0.603
5833 llama-3.1-70B-instruct 0.603
5956 llama-3.1-70B-instruct 0.603
7386 llama-3.1-70B-instruct 0.603
9320 llama-3.1-70B-instruct 0.603
11749 qwen2-72b-instruct 0.601
6341 qwen2-72b-instruct 0.601
686 qwen2-72b-instruct 0.601
5426 qwen2-72b-instruct 0.601
5537 qwen2.5-coder-32b-instruct 0.593
1342 google_gemma_3_12b_it 0.584
10264 google_gemma_3_12b_it 0.584
8567 deepseek_r1_distill_qwen_14b 0.447
5029 qwen2-math-72b-instruct 0.444
4835 qwen1.5-32b-chat 0.433
9142 qwen3-1.7b 0.424
5054 qwen3-1.7b 0.424
11313 qwen3-1.7b 0.424
11533 qwen3-1.7b 0.424
6109 qwen3-1.7b 0.424
8062 google_gemma_3_4b_it 0.415
10613 qwen2-7b-instruct 0.410
4654 qwen1.5-14b-chat 0.360
8042 mistralai_ministral_8b_instruct_2410 0.328
8212 mistralai_ministral_8b_instruct_2410 0.328
5694 mistralai_ministral_8b_instruct_2410 0.328
221 qwen2.5-coder-7b-instruct 0.321
11637 qwen2.5-coder-7b-instruct 0.321
7121 qwen2.5-coder-7b-instruct 0.321
1820 mistralai_mistral_7b_instruct_v0.3 0.313
10914 llama-3.2-3B-instruct 0.292
6906 llama-3.2-3B-instruct 0.292
6329 qwen2-math-7b-instruct 0.290
3954 qwen2-math-7b-instruct 0.290
2223 qwen2-math-7b-instruct 0.290
6663 qwen2-math-7b-instruct 0.290
10879 qwen2-math-7b-instruct 0.290
4967 qwen2-math-7b-instruct 0.290
5367 mistralai_mistral_7b_instruct_v0.2 0.279
2682 mistralai_mistral_7b_instruct_v0.2 0.279
6948 deepseek_v2_lite_chat 0.261
4382 qwen2.5-coder-3b-instruct 0.259
5797 qwen1.5-7b-chat 0.231
9551 qwen3-0.6b 0.230
8432 qwen3-0.6b 0.230
9820 qwen3-0.6b 0.230
8968 qwen3-0.6b 0.230
11484 qwen3-0.6b 0.230
9208 mistralai_mistral_7b_instruct_v0.1 0.190
11205 mistralai_mistral_7b_instruct_v0.1 0.190
4183 mistralai_mistral_7b_instruct_v0.1 0.190
6100 mistralai_mistral_7b_instruct_v0.1 0.190
9502 mistralai_mistral_7b_instruct_v0.1 0.190
5568 qwen2.5-coder-1.5b-instruct 0.169
9737 qwen2.5-coder-1.5b-instruct 0.169
11688 qwen2.5-coder-1.5b-instruct 0.169
11981 qwen2.5-coder-1.5b-instruct 0.169
8672 qwen2-math-1.5b-instruct 0.166
9880 qwen2-math-1.5b-instruct 0.166
5809 qwen2-math-1.5b-instruct 0.166
6095 qwen2-math-1.5b-instruct 0.166
3519 qwen2-math-1.5b-instruct 0.166
6192 qwen2-math-1.5b-instruct 0.166
9661 qwen2-math-1.5b-instruct 0.166
11451 qwen2-math-1.5b-instruct 0.166
5994 qwen2-math-1.5b-instruct 0.166
11274 llama-3.2-1B-instruct 0.165
4485 llama-3.2-1B-instruct 0.165
9782 llama-3.2-1B-instruct 0.165
5983 deepseek_r1_distill_qwen_1.5b 0.159
5639 deepseek_r1_distill_qwen_1.5b 0.159
4950 qwen2-1.5b-instruct 0.116
11394 qwen2-1.5b-instruct 0.116
11422 qwen2-1.5b-instruct 0.116
6019 qwen2-0.5b-instruct 0.095
5031 qwen2-0.5b-instruct 0.095
5185 qwen2-0.5b-instruct 0.095
11440 qwen2-0.5b-instruct 0.095
1329 qwen2-0.5b-instruct 0.095
8138 qwen2-0.5b-instruct 0.095
1581 qwen2-0.5b-instruct 0.095
6818 qwen2-0.5b-instruct 0.095
9853 qwen2-0.5b-instruct 0.095
5944 qwen2.5-coder-0.5b-instruct 0.093
1325 qwen2.5-coder-0.5b-instruct 0.093
1303 qwen2.5-coder-0.5b-instruct 0.093
9739 qwen2.5-coder-0.5b-instruct 0.093
7036 qwen2.5-coder-0.5b-instruct 0.093
2700 qwen2.5-coder-0.5b-instruct 0.093
6363 qwen2.5-coder-0.5b-instruct 0.093
6805 qwen1.5-1.8b-chat 0.088
616 qwen1.5-1.8b-chat 0.088
13 qwen1.5-1.8b-chat 0.088
6742 qwen1.5-1.8b-chat 0.088
9832 qwen1.5-0.5b-chat 0.055
5528 qwen1.5-0.5b-chat 0.055
5865 qwen1.5-0.5b-chat 0.055
5144 qwen1.5-0.5b-chat 0.055
5349 qwen1.5-0.5b-chat 0.055
7241 qwen1.5-0.5b-chat 0.055

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
5495 0.111 -0.494
3512 0.039 -0.422
9882 0.117 -0.422
5056 0.072 -0.417
956 0.024 -0.406
4691 0.094 -0.403
350 0.070 -0.401
425 0.091 -0.394
4192 0.124 -0.388
8163 0.026 -0.387

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.