mmlu: by examples

Results Paper Code


Not solved by any model

There are 84 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
mmlu/10112, mmlu/10177, mmlu/10356, mmlu/10370, mmlu/1047, mmlu/10549, mmlu/10576, mmlu/10582, mmlu/10621, mmlu/10752, mmlu/10802, mmlu/11046, mmlu/11059, mmlu/11064, mmlu/11109, mmlu/11302, mmlu/11632, mmlu/11725, mmlu/11789, mmlu/11829, mmlu/11880, mmlu/12161, mmlu/1269, mmlu/12698, mmlu/12820, mmlu/1298, mmlu/13190, mmlu/13218, mmlu/13722, mmlu/13734, mmlu/13740, mmlu/13747, mmlu/13751, mmlu/13767, mmlu/13768, mmlu/13788, mmlu/13795, mmlu/1380, mmlu/13825, mmlu/1416, mmlu/1457, mmlu/1481, mmlu/1676, mmlu/168, mmlu/1720, mmlu/1748, mmlu/1931, mmlu/2025, mmlu/2131, mmlu/2392, mmlu/2440, mmlu/2951, mmlu/3072, mmlu/3077, mmlu/3295, mmlu/3356, mmlu/3405, mmlu/3507, mmlu/419, mmlu/4222, mmlu/4329, mmlu/4377, mmlu/4393, mmlu/4448, mmlu/479, mmlu/516, mmlu/552, mmlu/6156, mmlu/6514, mmlu/6864, mmlu/6908, mmlu/7467, mmlu/8177, mmlu/8195, mmlu/8235, mmlu/8280, mmlu/8388, mmlu/8445, mmlu/8815, mmlu/9057, mmlu/9080, mmlu/9607, mmlu/9801, mmlu/9917

Problems solved by 1 model only

example_link model min_pass1_of_model
mmlu/1390 Qwen1.5-110B 0.811
mmlu/2250 Qwen1.5-110B 0.811
mmlu/2255 Qwen1.5-110B 0.811
mmlu/2416 Qwen1.5-110B 0.811
mmlu/2092 Qwen1.5-110B 0.811
mmlu/12903 Qwen1.5-110B 0.811
mmlu/10778 Qwen1.5-110B 0.811
mmlu/2211 Qwen1.5-110B 0.811
mmlu/10402 Qwen1.5-110B 0.811
mmlu/4842 Meta-Llama-3-70B 0.787
mmlu/1033 Meta-Llama-3-70B 0.787
mmlu/3990 Meta-Llama-3-70B 0.787
mmlu/1738 Meta-Llama-3-70B 0.787
mmlu/10411 Meta-Llama-3-70B 0.787
mmlu/90 Meta-Llama-3-70B 0.787
mmlu/3008 Meta-Llama-3-70B 0.787
mmlu/8503 Mixtral-8x22B-v0.1 0.776
mmlu/1398 Mixtral-8x22B-v0.1 0.776
mmlu/2508 Mixtral-8x22B-v0.1 0.776
mmlu/2464 Mixtral-8x22B-v0.1 0.776
mmlu/6273 Mixtral-8x22B-v0.1 0.776
mmlu/12040 Mixtral-8x22B-v0.1 0.776
mmlu/4315 Mixtral-8x22B-v0.1 0.776
mmlu/1403 Mixtral-8x22B-v0.1 0.776
mmlu/8175 Mixtral-8x22B-v0.1 0.776
mmlu/2140 Mixtral-8x22B-v0.1 0.776
mmlu/3214 Qwen1.5-72B 0.772
mmlu/11492 Qwen1.5-72B 0.772
mmlu/5876 Qwen1.5-72B 0.772
mmlu/6935 Qwen1.5-72B 0.772
mmlu/10705 Qwen1.5-72B 0.772
mmlu/1447 dbrx-base 0.743
mmlu/10517 dbrx-base 0.743
mmlu/965 dbrx-base 0.743
mmlu/4647 dbrx-base 0.743
mmlu/1194 dbrx-base 0.743
mmlu/2579 dbrx-base 0.743
mmlu/8862 dbrx-base 0.743
mmlu/12654 dbrx-base 0.743
mmlu/10939 Qwen1.5-32B 0.736
mmlu/4276 Qwen1.5-32B 0.736
mmlu/1260 Qwen1.5-32B 0.736
mmlu/1404 Qwen1.5-32B 0.736
mmlu/2160 Qwen1.5-32B 0.736
mmlu/3997 deepseek-llm-67b-base 0.714
mmlu/3723 deepseek-llm-67b-base 0.714
mmlu/6148 deepseek-llm-67b-base 0.714
mmlu/12674 Mixtral-8x7B-v0.1 0.703
mmlu/2734 Mixtral-8x7B-v0.1 0.703
mmlu/1431 Qwen1.5-14B 0.678
mmlu/9161 Meta-Llama-3-8B 0.653
mmlu/2668 llama2_70B 0.632
mmlu/12497 gemma-7b 0.626
mmlu/3140 Mistral-7B-v0.1 0.625
mmlu/4342 Mistral-7B-v0.1 0.625
mmlu/4755 llama_65B 0.622
mmlu/3037 llama_65B 0.622
mmlu/10736 llama_65B 0.622
mmlu/2475 llama_33B 0.570
mmlu/940 falcon-40b 0.554
mmlu/1418 falcon-40b 0.554
mmlu/3440 falcon-40b 0.554
mmlu/3299 falcon-40b 0.554
mmlu/10583 falcon-40b 0.554
mmlu/4864 falcon-40b 0.554
mmlu/72 falcon-40b 0.554
mmlu/923 falcon-40b 0.554
mmlu/978 falcon-40b 0.554
mmlu/11254 Qwen1.5-4B 0.552
mmlu/2319 deepseek-llm-7b-base 0.481
mmlu/1493 deepseek-llm-7b-base 0.481
mmlu/6835 deepseek-llm-7b-base 0.481
mmlu/6914 deepseek-llm-7b-base 0.481
mmlu/10355 deepseek-llm-7b-base 0.481
mmlu/1097 deepseek-llm-7b-base 0.481
mmlu/7889 llama2_07B 0.473
mmlu/12213 Qwen1.5-1.8B 0.456
mmlu/13785 Qwen1.5-1.8B 0.456
mmlu/11664 llama_13B 0.456
mmlu/914 deepseek-moe-16b-base 0.449
mmlu/13253 deepseek-moe-16b-base 0.449
mmlu/2597 deepseek-moe-16b-base 0.449
mmlu/4863 stablelm-3b-4e1t 0.444
mmlu/225 stablelm-3b-4e1t 0.444
mmlu/3474 stablelm-base-alpha-7b-v2 0.444
mmlu/4122 stablelm-base-alpha-7b-v2 0.444
mmlu/7 stablelm-base-alpha-7b-v2 0.444
mmlu/2129 stablelm-base-alpha-7b-v2 0.444
mmlu/13044 gemma-2b 0.410
mmlu/13841 gemma-2b 0.410
mmlu/13842 gemma-2b 0.410
mmlu/6651 gemma-2b 0.410
mmlu/11510 gemma-2b 0.410
mmlu/1841 gemma-2b 0.410
mmlu/10699 gemma-2b 0.410
mmlu/4462 Qwen1.5-0.5B 0.384
mmlu/11661 Qwen1.5-0.5B 0.384
mmlu/4484 Qwen1.5-0.5B 0.384
mmlu/11057 Qwen1.5-0.5B 0.384
mmlu/2491 Qwen1.5-0.5B 0.384
mmlu/13739 llama_07B 0.351
mmlu/8379 llama_07B 0.351
mmlu/937 llama_07B 0.351
mmlu/4559 llama_07B 0.351
mmlu/982 llama_07B 0.351
mmlu/6755 llama_07B 0.351
mmlu/12469 llama_07B 0.351
mmlu/4405 llama_07B 0.351
mmlu/1326 falcon-7b 0.272
mmlu/5654 falcon-7b 0.272
mmlu/6221 falcon-7b 0.272
mmlu/165 falcon-7b 0.272
mmlu/10368 falcon-7b 0.272
mmlu/10523 falcon-7b 0.272
mmlu/1003 pythia-2.8b-deduped 0.264
mmlu/8490 pythia-2.8b-deduped 0.264
mmlu/207 pythia-2.8b-deduped 0.264
mmlu/155 pythia-2.8b-deduped 0.264
mmlu/11520 pythia-2.8b-deduped 0.264
mmlu/4779 pythia-2.8b-deduped 0.264
mmlu/4829 pythia-2.8b-deduped 0.264
mmlu/11540 pythia-2.8b-deduped 0.264
mmlu/4761 pythia-2.8b-deduped 0.264
mmlu/2647 pythia-2.8b-deduped 0.264
mmlu/8314 pythia-2.8b-deduped 0.264
mmlu/4106 pythia-2.8b-deduped 0.264
mmlu/4176 pythia-2.8b-deduped 0.264
mmlu/3916 pythia-2.8b-deduped 0.264
mmlu/988 pythia-2.8b-deduped 0.264
mmlu/8218 pythia-2.8b-deduped 0.264
mmlu/11054 pythia-12b-deduped-v0 0.247
mmlu/8501 pythia-12b-deduped-v0 0.247
mmlu/10668 pythia-12b-deduped-v0 0.247
mmlu/6101 pythia-12b-deduped-v0 0.247
mmlu/9810 pythia-6.9b-deduped-v0 0.247
mmlu/203 pythia-6.9b-deduped-v0 0.247
mmlu/1876 pythia-6.9b-deduped-v0 0.247
mmlu/8342 pythia-6.9b-deduped-v0 0.247
mmlu/2961 pythia-6.9b-deduped-v0 0.247
mmlu/9819 pythia-6.9b-deduped-v0 0.247
mmlu/8465 pythia-6.9b-deduped-v0 0.247
mmlu/11089 pythia-1b-deduped 0.246
mmlu/1979 pythia-1b-deduped 0.246
mmlu/133 pythia-1b-deduped 0.246
mmlu/5839 pythia-1b-deduped 0.246
mmlu/4816 pythia-1b-deduped 0.246
mmlu/5420 pythia-1b-deduped 0.246
mmlu/10665 pythia-1.4b-deduped-v0 0.233
mmlu/10911 pythia-1.4b-deduped-v0 0.233
mmlu/11410 pythia-1.4b-deduped-v0 0.233
mmlu/11741 pythia-1.4b-deduped-v0 0.233
mmlu/11742 pythia-1.4b-deduped-v0 0.233
mmlu/11810 pythia-1.4b-deduped-v0 0.233
mmlu/11886 pythia-1.4b-deduped-v0 0.233
mmlu/12092 pythia-1.4b-deduped-v0 0.233
mmlu/13474 pythia-1.4b-deduped-v0 0.233
mmlu/13833 pythia-1.4b-deduped-v0 0.233
mmlu/12500 pythia-1.4b-deduped-v0 0.233
mmlu/1877 pythia-1.4b-deduped-v0 0.233
mmlu/10980 pythia-1.4b-deduped-v0 0.233
mmlu/10993 pythia-1.4b-deduped-v0 0.233
mmlu/9926 pythia-1.4b-deduped-v0 0.233

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
mmlu/2534 0.194 -0.551
mmlu/805 0.306 -0.550
mmlu/24 0.333 -0.526
mmlu/1461 0.167 -0.505
mmlu/4699 0.194 -0.501
mmlu/329 0.194 -0.495
mmlu/6255 0.250 -0.493
mmlu/8429 0.139 -0.490
mmlu/1467 0.222 -0.484
mmlu/5196 0.278 -0.484

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.