There are 274 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
104, 107, 108, 121, 122, 131, 142, 15, 157, 172, 173, 174, 181, 182, 183, 184, 195, 196, 197, 199, 200, 201, 202, 203, 204, 208, 209, 210, 211, 220, 221, 222, 225, 242, 244, 245, 269, 270, 272, 281, 282, 283, 284, 285, 286, 29, 328, 339, 362, 371, 372, 375, 379, 380, 385, 389, 39, 390, 394, 40, 408, 410, 42, 420, 421, 43, 44, 445, 45, 455, 46, 47, 475, 48, 486, 487, 488, 49, 509, 516, 520, 521, 585, 59, 6, 60, 612, 65, 666, 667, 668, 669, 67, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 726, 727, 73, 74, 745, 747, 75, 763, 764, 766, 775, 776, 779, 780, 789, 790, 794, 795, 796, 798, 808, 81, 810, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 86, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 88, 880, 881, 882, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 9, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 936, 937, 96, 987, 993
| example_link | model | min_pass1_of_model |
|---|---|---|
| 781 | qwen2.5-coder-14b-instruct | 0.432 |
| 66 | qwen2.5-coder-14b-instruct | 0.432 |
| 750 | qwen2.5-coder-14b-instruct | 0.432 |
| 338 | qwen2.5-coder-14b-instruct | 0.432 |
| 751 | qwen2.5-coder-14b-instruct | 0.432 |
| 58 | qwen2.5-coder-14b-instruct | 0.432 |
| 996 | qwen3-14b | 0.380 |
| 134 | qwen3-14b | 0.380 |
| 54 | qwen3-14b | 0.380 |
| 7 | qwen3-14b | 0.380 |
| 279 | qwen3-14b | 0.380 |
| 216 | qwen3-14b | 0.380 |
| 280 | qwen3-14b | 0.380 |
| 388 | qwen3-14b | 0.380 |
| 809 | qwen3-14b | 0.380 |
| 164 | qwen3-14b | 0.380 |
| 113 | qwen2.5-coder-7b-instruct | 0.357 |
| 112 | google_gemma_3_12b_it | 0.328 |
| 755 | google_gemma_3_12b_it | 0.328 |
| 446 | google_gemma_3_12b_it | 0.328 |
| 158 | qwen3-8b | 0.312 |
| 812 | qwen3-8b | 0.312 |
| 407 | mistralai_ministral_8b_instruct_2410 | 0.247 |
| 161 | qwen2.5-coder-3b-instruct | 0.238 |
| 228 | qwen2.5-coder-3b-instruct | 0.238 |
| 227 | qwen2.5-coder-3b-instruct | 0.238 |
| 263 | qwen2-7b-instruct | 0.236 |
| 345 | qwen1.5-14b-chat | 0.205 |
| 635 | google_gemma_3_4b_it | 0.195 |
| 523 | mistralai_mistral_7b_instruct_v0.3 | 0.182 |
| 800 | deepseek_v2_lite_chat | 0.174 |
| 468 | deepseek_r1_distill_qwen_14b | 0.161 |
| 953 | deepseek_r1_distill_qwen_7b | 0.148 |
| 596 | qwen2.5-coder-1.5b-instruct | 0.141 |
| 159 | mistralai_mistral_7b_instruct_v0.2 | 0.136 |
| 447 | google_gemma_7b_it | 0.048 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 604 | 0.004 | -0.171 |
| 802 | 0.011 | -0.157 |
| 447 | 0.002 | -0.119 |
| 603 | 0.038 | -0.038 |
| 159 | 0.003 | -0.024 |
| 649 | 0.303 | -0.023 |
| 533 | 0.029 | -0.016 |
| 596 | 0.002 | -0.012 |
| 165 | 0.027 | -0.010 |
| 953 | 0.002 | 0.000 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.