There are 156 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
10155, 10380, 10497, 1131, 1209, 1388, 1416, 1444, 1465, 1585, 1611, 164, 1641, 1658, 1675, 1932, 2044, 2121, 2122, 2344, 2504, 2507, 2515, 2564, 2692, 2729, 2732, 2769, 2778, 2866, 2889, 291, 2949, 3043, 3067, 3112, 3138, 3148, 3179, 3275, 3276, 3281, 3377, 3386, 3409, 3507, 3576, 3582, 3610, 3637, 3638, 380, 3818, 3953, 3954, 3961, 3973, 4033, 4211, 4222, 4275, 4348, 4401, 442, 4432, 4434, 4443, 4548, 4587, 4720, 4722, 4725, 4884, 4888, 4890, 4892, 4923, 4963, 499, 501, 5120, 5132, 5198, 5232, 5258, 5281, 5338, 5403, 5657, 5788, 5796, 6035, 6036, 6241, 6388, 6401, 6467, 6499, 6537, 6590, 6640, 6779, 6809, 682, 6844, 6865, 6897, 690, 696, 699, 7012, 7120, 7148, 7232, 7233, 7337, 7356, 7440, 7552, 7564, 7611, 7619, 7705, 786, 7995, 8008, 8009, 8011, 8012, 8027, 8040, 8128, 8131, 8147, 8336, 8337, 8386, 8387, 8389, 843, 8531, 8562, 8706, 8708, 8843, 9096, 9276, 9311, 9329, 9404, 9522, 958, 9627, 9635, 980, 9873
| example_link | model | min_pass1_of_model |
|---|---|---|
| 2371 | qwen3-32b | 0.814 |
| 5405 | qwen3-14b | 0.811 |
| 6312 | llama-3.1-70B-instruct | 0.773 |
| 8014 | llama-3.1-70B-instruct | 0.773 |
| 4497 | llama-3.1-70B-instruct | 0.773 |
| 2138 | google_gemma_3_27b_it | 0.751 |
| 3958 | google_gemma_3_27b_it | 0.751 |
| 2950 | qwen2-72b-instruct | 0.744 |
| 467 | qwen2-72b-instruct | 0.744 |
| 9868 | qwen2-72b-instruct | 0.744 |
| 3427 | qwen2-72b-instruct | 0.744 |
| 7482 | qwen3-4b | 0.737 |
| 3243 | deepseek_r1_distill_qwen_14b | 0.710 |
| 5512 | deepseek_r1_distill_qwen_14b | 0.710 |
| 450 | google_gemma_2_9b_it | 0.701 |
| 9539 | deepseek_r1_distill_qwen_7b | 0.686 |
| 1972 | deepseek_r1_distill_qwen_7b | 0.686 |
| 4723 | qwen2.5-coder-14b-instruct | 0.668 |
| 6262 | qwen2.5-coder-14b-instruct | 0.668 |
| 956 | qwen2-math-72b-instruct | 0.663 |
| 2139 | qwen2-math-72b-instruct | 0.663 |
| 4726 | qwen2-math-72b-instruct | 0.663 |
| 1816 | qwen2-math-72b-instruct | 0.663 |
| 1819 | qwen2-math-72b-instruct | 0.663 |
| 1299 | qwen1.5-72b-chat | 0.657 |
| 8705 | qwen1.5-72b-chat | 0.657 |
| 4258 | qwen1.5-72b-chat | 0.657 |
| 7627 | qwen1.5-32b-chat | 0.648 |
| 4148 | qwen1.5-32b-chat | 0.648 |
| 5067 | google_gemma_3_4b_it | 0.641 |
| 2748 | mistralai_mixtral_8x22b_instruct_v0.1 | 0.632 |
| 9075 | mistralai_mixtral_8x22b_instruct_v0.1 | 0.632 |
| 5052 | mistralai_mixtral_8x22b_instruct_v0.1 | 0.632 |
| 1258 | mistralai_mixtral_8x22b_instruct_v0.1 | 0.632 |
| 9523 | qwen2-math-7b-instruct | 0.620 |
| 5451 | qwen2-math-7b-instruct | 0.620 |
| 9363 | qwen2-math-7b-instruct | 0.620 |
| 2953 | qwen2-math-7b-instruct | 0.620 |
| 954 | llama-3.1-8B-instruct | 0.606 |
| 5675 | mistralai_ministral_8b_instruct_2410 | 0.597 |
| 1532 | qwen2-7b-instruct | 0.592 |
| 8180 | qwen2-7b-instruct | 0.592 |
| 2946 | qwen2-7b-instruct | 0.592 |
| 8211 | qwen2-7b-instruct | 0.592 |
| 9043 | qwen3-1.7b | 0.586 |
| 9068 | qwen2-math-1.5b-instruct | 0.568 |
| 4042 | qwen2-math-1.5b-instruct | 0.568 |
| 104 | qwen1.5-14b-chat | 0.536 |
| 1393 | qwen1.5-14b-chat | 0.536 |
| 6459 | qwen1.5-14b-chat | 0.536 |
| 5322 | qwen1.5-14b-chat | 0.536 |
| 5660 | qwen1.5-14b-chat | 0.536 |
| 9932 | qwen2.5-coder-7b-instruct | 0.519 |
| 6898 | qwen2.5-coder-7b-instruct | 0.519 |
| 5195 | deepseek_r1_distill_qwen_1.5b | 0.496 |
| 5376 | mistralai_mathstral_7b_v0.1 | 0.475 |
| 5771 | mistralai_mathstral_7b_v0.1 | 0.475 |
| 5221 | mistralai_mathstral_7b_v0.1 | 0.475 |
| 1364 | mistralai_mathstral_7b_v0.1 | 0.475 |
| 1507 | mistralai_mathstral_7b_v0.1 | 0.475 |
| 7332 | qwen1.5-7b-chat | 0.402 |
| 4164 | qwen1.5-7b-chat | 0.402 |
| 4531 | google_codegemma_1.1_7b_it | 0.335 |
| 9810 | google_gemma_3_1b_it | 0.289 |
| 3230 | google_gemma_3_1b_it | 0.289 |
| 7236 | qwen3-0.6b | 0.280 |
| 4436 | qwen3-0.6b | 0.280 |
| 4617 | qwen3-0.6b | 0.280 |
| 1056 | mistralai_mistral_7b_instruct_v0.2 | 0.279 |
| 9608 | llama-3.2-1B-instruct | 0.190 |
| 5812 | llama-3.2-1B-instruct | 0.190 |
| 7347 | google_gemma_7b_it | 0.185 |
| 4311 | mistralai_mistral_7b_instruct_v0.1 | 0.160 |
| 1191 | qwen2-1.5b-instruct | 0.152 |
| 8394 | qwen2-1.5b-instruct | 0.152 |
| 6447 | qwen1.5-1.8b-chat | 0.116 |
| 5297 | qwen2.5-coder-0.5b-instruct | 0.071 |
| 1168 | qwen2.5-coder-0.5b-instruct | 0.071 |
| 3193 | qwen2.5-coder-0.5b-instruct | 0.071 |
| 3464 | qwen2.5-coder-0.5b-instruct | 0.071 |
| 936 | qwen2.5-coder-0.5b-instruct | 0.071 |
| 2891 | qwen2-0.5b-instruct | 0.066 |
| 4879 | qwen2-0.5b-instruct | 0.066 |
| 6905 | google_gemma_2b_it | 0.062 |
| 2867 | google_gemma_2b_it | 0.062 |
| 3904 | google_gemma_2b_it | 0.062 |
| 411 | google_gemma_2b_it | 0.062 |
| 7299 | google_gemma_2b_it | 0.062 |
| 9719 | qwen1.5-0.5b-chat | 0.054 |
| 8092 | qwen1.5-0.5b-chat | 0.054 |
| 3220 | qwen1.5-0.5b-chat | 0.054 |
| 2890 | qwen1.5-0.5b-chat | 0.054 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 8015 | 0.083 | -0.459 |
| 2935 | 0.046 | -0.431 |
| 10047 | 0.038 | -0.413 |
| 9991 | 0.074 | -0.411 |
| 717 | 0.199 | -0.404 |
| 7535 | 0.080 | -0.393 |
| 9252 | 0.072 | -0.389 |
| 2039 | 0.045 | -0.387 |
| 7068 | 0.252 | -0.380 |
| 5663 | 0.026 | -0.378 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.