ds1000: by examples

Results Paper Code


Not solved by any model

There are 324 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
104, 107, 108, 121, 122, 134, 142, 15, 151, 152, 153, 154, 157, 161, 164, 172, 174, 181, 182, 183, 184, 186, 195, 196, 197, 199, 200, 201, 202, 203, 204, 208, 209, 210, 211, 216, 220, 221, 222, 225, 226, 227, 228, 238, 242, 244, 245, 253, 257, 269, 27, 270, 28, 280, 281, 282, 283, 284, 285, 286, 287, 29, 304, 319, 328, 338, 339, 346, 347, 362, 371, 372, 375, 379, 380, 385, 387, 389, 39, 390, 394, 40, 408, 410, 411, 42, 420, 421, 423, 43, 439, 44, 445, 447, 45, 455, 46, 468, 47, 475, 48, 486, 487, 488, 49, 505, 509, 516, 520, 521, 523, 56, 57, 58, 59, 596, 6, 60, 604, 612, 626, 64, 65, 66, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 7, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 726, 727, 73, 74, 745, 747, 749, 75, 750, 751, 755, 763, 764, 765, 766, 775, 779, 780, 781, 783, 789, 79, 790, 794, 795, 796, 798, 802, 808, 809, 81, 810, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 86, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 88, 880, 881, 882, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 9, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 936, 937, 944, 95, 953, 96, 987, 993, 996, 997

Problems solved by 1 model only

example_link model min_pass1_of_model
106 qwen2.5-coder-14b-instruct 0.386
279 qwen2.5-coder-14b-instruct 0.386
973 qwen2.5-coder-14b-instruct 0.386
743 qwen2.5-coder-14b-instruct 0.386
407 qwen2.5-coder-14b-instruct 0.386
241 qwen2.5-coder-14b-instruct 0.386
131 qwen3-14b 0.370
621 qwen3-14b 0.370
784 qwen3-14b 0.370
55 qwen3-14b 0.370
32 qwen3-14b 0.370
402 qwen3-14b 0.370
105 google_gemma_3_12b_it 0.322
187 google_gemma_3_12b_it 0.322
130 google_gemma_3_12b_it 0.322
446 google_gemma_3_12b_it 0.322
67 google_gemma_3_12b_it 0.322
159 qwen3-8b 0.300
776 qwen3-8b 0.300
662 qwen3-8b 0.300
113 qwen3-4b 0.285
515 qwen3-4b 0.285
258 qwen3-4b 0.285
87 qwen3-4b 0.285
173 qwen2.5-coder-7b-instruct 0.285
272 qwen1.5-14b-chat 0.188
80 google_codegemma_1.1_7b_it 0.186
263 google_codegemma_1.1_7b_it 0.186
772 google_codegemma_1.1_7b_it 0.186
94 google_codegemma_1.1_7b_it 0.186
264 mistralai_ministral_8b_instruct_2410 0.177
787 mistralai_ministral_8b_instruct_2410 0.177
345 mistralai_mathstral_7b_v0.1 0.168
883 mistralai_mistral_7b_instruct_v0.3 0.165
8 mistralai_mistral_7b_instruct_v0.3 0.165
812 qwen3-1.7b 0.146
603 deepseek_r1_distill_qwen_14b 0.140
582 google_gemma_2_9b_it 0.105
299 qwen2-1.5b-instruct 0.029

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
299 0.006 -0.143
533 0.023 -0.086
582 0.009 -0.024
635 0.022 0.004
646 0.050 0.036
240 0.019 0.038
603 0.006 0.048
649 0.231 0.049
165 0.013 0.051
812 0.006 0.060

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.