ds1000: by examples

Results Paper Code

Not solved by any model

There are 274 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
104, 107, 108, 121, 122, 131, 142, 15, 157, 172, 173, 174, 181, 182, 183, 184, 195, 196, 197, 199, 200, 201, 202, 203, 204, 208, 209, 210, 211, 220, 221, 222, 225, 242, 244, 245, 269, 270, 272, 281, 282, 283, 284, 285, 286, 29, 328, 339, 362, 371, 372, 375, 379, 380, 385, 389, 39, 390, 394, 40, 408, 410, 42, 420, 421, 43, 44, 445, 45, 455, 46, 47, 475, 48, 486, 487, 488, 49, 509, 516, 520, 521, 585, 59, 6, 60, 612, 65, 666, 667, 668, 669, 67, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 726, 727, 73, 74, 745, 747, 75, 763, 764, 766, 775, 776, 779, 780, 789, 790, 794, 795, 796, 798, 808, 81, 810, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 86, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 88, 880, 881, 882, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 9, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 936, 937, 96, 987, 993

Problems solved by 1 model only

example_link	model	min_pass1_of_model
781	qwen2.5-coder-14b-instruct	0.432
66	qwen2.5-coder-14b-instruct	0.432
750	qwen2.5-coder-14b-instruct	0.432
338	qwen2.5-coder-14b-instruct	0.432
751	qwen2.5-coder-14b-instruct	0.432
58	qwen2.5-coder-14b-instruct	0.432
996	qwen3-14b	0.380
134	qwen3-14b	0.380
54	qwen3-14b	0.380
7	qwen3-14b	0.380
279	qwen3-14b	0.380
216	qwen3-14b	0.380
280	qwen3-14b	0.380
388	qwen3-14b	0.380
809	qwen3-14b	0.380
164	qwen3-14b	0.380
113	qwen2.5-coder-7b-instruct	0.357
112	google_gemma_3_12b_it	0.328
755	google_gemma_3_12b_it	0.328
446	google_gemma_3_12b_it	0.328
158	qwen3-8b	0.312
812	qwen3-8b	0.312
407	mistralai_ministral_8b_instruct_2410	0.247
161	qwen2.5-coder-3b-instruct	0.238
228	qwen2.5-coder-3b-instruct	0.238
227	qwen2.5-coder-3b-instruct	0.238
263	qwen2-7b-instruct	0.236
345	qwen1.5-14b-chat	0.205
635	google_gemma_3_4b_it	0.195
523	mistralai_mistral_7b_instruct_v0.3	0.182
800	deepseek_v2_lite_chat	0.174
468	deepseek_r1_distill_qwen_14b	0.161
953	deepseek_r1_distill_qwen_7b	0.148
596	qwen2.5-coder-1.5b-instruct	0.141
159	mistralai_mistral_7b_instruct_v0.2	0.136
447	google_gemma_7b_it	0.048

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
604	0.004	-0.171
802	0.011	-0.157
447	0.002	-0.119
603	0.038	-0.038
159	0.003	-0.024
649	0.303	-0.023
533	0.029	-0.016
596	0.002	-0.012
165	0.027	-0.010
953	0.002	0.000

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.