ds1000: by examples

Results Paper Code

Not solved by any model

There are 324 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
104, 107, 108, 121, 122, 134, 142, 15, 151, 152, 153, 154, 157, 161, 164, 172, 174, 181, 182, 183, 184, 186, 195, 196, 197, 199, 200, 201, 202, 203, 204, 208, 209, 210, 211, 216, 220, 221, 222, 225, 226, 227, 228, 238, 242, 244, 245, 253, 257, 269, 27, 270, 28, 280, 281, 282, 283, 284, 285, 286, 287, 29, 304, 319, 328, 338, 339, 346, 347, 362, 371, 372, 375, 379, 380, 385, 387, 389, 39, 390, 394, 40, 408, 410, 411, 42, 420, 421, 423, 43, 439, 44, 445, 447, 45, 455, 46, 468, 47, 475, 48, 486, 487, 488, 49, 505, 509, 516, 520, 521, 523, 56, 57, 58, 59, 596, 6, 60, 604, 612, 626, 64, 65, 66, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 7, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 726, 727, 73, 74, 745, 747, 749, 75, 750, 751, 755, 763, 764, 765, 766, 775, 779, 780, 781, 783, 789, 79, 790, 794, 795, 796, 798, 802, 808, 809, 81, 810, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 86, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 88, 880, 881, 882, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 9, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 936, 937, 944, 95, 953, 96, 987, 993, 996, 997

Problems solved by 1 model only

example_link	model	min_pass1_of_model
106	qwen2.5-coder-14b-instruct	0.386
279	qwen2.5-coder-14b-instruct	0.386
973	qwen2.5-coder-14b-instruct	0.386
743	qwen2.5-coder-14b-instruct	0.386
407	qwen2.5-coder-14b-instruct	0.386
241	qwen2.5-coder-14b-instruct	0.386
131	qwen3-14b	0.370
621	qwen3-14b	0.370
784	qwen3-14b	0.370
55	qwen3-14b	0.370
32	qwen3-14b	0.370
402	qwen3-14b	0.370
105	google_gemma_3_12b_it	0.322
187	google_gemma_3_12b_it	0.322
130	google_gemma_3_12b_it	0.322
446	google_gemma_3_12b_it	0.322
67	google_gemma_3_12b_it	0.322
159	qwen3-8b	0.300
776	qwen3-8b	0.300
662	qwen3-8b	0.300
113	qwen3-4b	0.285
515	qwen3-4b	0.285
258	qwen3-4b	0.285
87	qwen3-4b	0.285
173	qwen2.5-coder-7b-instruct	0.285
272	qwen1.5-14b-chat	0.188
80	google_codegemma_1.1_7b_it	0.186
263	google_codegemma_1.1_7b_it	0.186
772	google_codegemma_1.1_7b_it	0.186
94	google_codegemma_1.1_7b_it	0.186
264	mistralai_ministral_8b_instruct_2410	0.177
787	mistralai_ministral_8b_instruct_2410	0.177
345	mistralai_mathstral_7b_v0.1	0.168
883	mistralai_mistral_7b_instruct_v0.3	0.165
8	mistralai_mistral_7b_instruct_v0.3	0.165
812	qwen3-1.7b	0.146
603	deepseek_r1_distill_qwen_14b	0.140
582	google_gemma_2_9b_it	0.105
299	qwen2-1.5b-instruct	0.029

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
299	0.006	-0.143
533	0.023	-0.086
582	0.009	-0.024
635	0.022	0.004
646	0.050	0.036
240	0.019	0.038
603	0.006	0.048
649	0.231	0.049
165	0.013	0.051
812	0.006	0.060

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.