There are 52 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1017, 11277, 11635, 11741, 11801, 1402, 1468, 2677, 2810, 2909, 3265, 3282, 3528, 3570, 4146, 4202, 4499, 4651, 4689, 4736, 4966, 5072, 5348, 5529, 56, 5724, 5857, 5895, 5997, 6063, 6081, 6117, 6177, 6232, 6343, 6355, 6600, 6769, 6859, 6949, 7006, 7145, 8249, 8754, 8851, 8880, 9324, 9571, 9649, 9902, 9917, 999
| example_link | model | min_pass1_of_model |
|---|---|---|
| 11402 | qwen3-32b | 0.690 |
| 2848 | qwen3-14b | 0.672 |
| 4945 | qwen3-14b | 0.672 |
| 128 | qwen3-8b | 0.625 |
| 6049 | llama-3.1-70B-instruct | 0.603 |
| 9565 | llama-3.1-70B-instruct | 0.603 |
| 5833 | llama-3.1-70B-instruct | 0.603 |
| 5956 | llama-3.1-70B-instruct | 0.603 |
| 7386 | llama-3.1-70B-instruct | 0.603 |
| 9320 | llama-3.1-70B-instruct | 0.603 |
| 11749 | qwen2-72b-instruct | 0.601 |
| 6341 | qwen2-72b-instruct | 0.601 |
| 686 | qwen2-72b-instruct | 0.601 |
| 5426 | qwen2-72b-instruct | 0.601 |
| 5537 | qwen2.5-coder-32b-instruct | 0.593 |
| 1342 | google_gemma_3_12b_it | 0.584 |
| 10264 | google_gemma_3_12b_it | 0.584 |
| 8567 | deepseek_r1_distill_qwen_14b | 0.447 |
| 5029 | qwen2-math-72b-instruct | 0.444 |
| 4835 | qwen1.5-32b-chat | 0.433 |
| 9142 | qwen3-1.7b | 0.424 |
| 5054 | qwen3-1.7b | 0.424 |
| 11313 | qwen3-1.7b | 0.424 |
| 11533 | qwen3-1.7b | 0.424 |
| 6109 | qwen3-1.7b | 0.424 |
| 8062 | google_gemma_3_4b_it | 0.415 |
| 10613 | qwen2-7b-instruct | 0.410 |
| 4654 | qwen1.5-14b-chat | 0.360 |
| 8042 | mistralai_ministral_8b_instruct_2410 | 0.328 |
| 8212 | mistralai_ministral_8b_instruct_2410 | 0.328 |
| 5694 | mistralai_ministral_8b_instruct_2410 | 0.328 |
| 221 | qwen2.5-coder-7b-instruct | 0.321 |
| 11637 | qwen2.5-coder-7b-instruct | 0.321 |
| 7121 | qwen2.5-coder-7b-instruct | 0.321 |
| 1820 | mistralai_mistral_7b_instruct_v0.3 | 0.313 |
| 10914 | llama-3.2-3B-instruct | 0.292 |
| 6906 | llama-3.2-3B-instruct | 0.292 |
| 6329 | qwen2-math-7b-instruct | 0.290 |
| 3954 | qwen2-math-7b-instruct | 0.290 |
| 2223 | qwen2-math-7b-instruct | 0.290 |
| 6663 | qwen2-math-7b-instruct | 0.290 |
| 10879 | qwen2-math-7b-instruct | 0.290 |
| 4967 | qwen2-math-7b-instruct | 0.290 |
| 5367 | mistralai_mistral_7b_instruct_v0.2 | 0.279 |
| 2682 | mistralai_mistral_7b_instruct_v0.2 | 0.279 |
| 6948 | deepseek_v2_lite_chat | 0.261 |
| 4382 | qwen2.5-coder-3b-instruct | 0.259 |
| 5797 | qwen1.5-7b-chat | 0.231 |
| 9551 | qwen3-0.6b | 0.230 |
| 8432 | qwen3-0.6b | 0.230 |
| 9820 | qwen3-0.6b | 0.230 |
| 8968 | qwen3-0.6b | 0.230 |
| 11484 | qwen3-0.6b | 0.230 |
| 9208 | mistralai_mistral_7b_instruct_v0.1 | 0.190 |
| 11205 | mistralai_mistral_7b_instruct_v0.1 | 0.190 |
| 4183 | mistralai_mistral_7b_instruct_v0.1 | 0.190 |
| 6100 | mistralai_mistral_7b_instruct_v0.1 | 0.190 |
| 9502 | mistralai_mistral_7b_instruct_v0.1 | 0.190 |
| 5568 | qwen2.5-coder-1.5b-instruct | 0.169 |
| 9737 | qwen2.5-coder-1.5b-instruct | 0.169 |
| 11688 | qwen2.5-coder-1.5b-instruct | 0.169 |
| 11981 | qwen2.5-coder-1.5b-instruct | 0.169 |
| 8672 | qwen2-math-1.5b-instruct | 0.166 |
| 9880 | qwen2-math-1.5b-instruct | 0.166 |
| 5809 | qwen2-math-1.5b-instruct | 0.166 |
| 6095 | qwen2-math-1.5b-instruct | 0.166 |
| 3519 | qwen2-math-1.5b-instruct | 0.166 |
| 6192 | qwen2-math-1.5b-instruct | 0.166 |
| 9661 | qwen2-math-1.5b-instruct | 0.166 |
| 11451 | qwen2-math-1.5b-instruct | 0.166 |
| 5994 | qwen2-math-1.5b-instruct | 0.166 |
| 11274 | llama-3.2-1B-instruct | 0.165 |
| 4485 | llama-3.2-1B-instruct | 0.165 |
| 9782 | llama-3.2-1B-instruct | 0.165 |
| 5983 | deepseek_r1_distill_qwen_1.5b | 0.159 |
| 5639 | deepseek_r1_distill_qwen_1.5b | 0.159 |
| 4950 | qwen2-1.5b-instruct | 0.116 |
| 11394 | qwen2-1.5b-instruct | 0.116 |
| 11422 | qwen2-1.5b-instruct | 0.116 |
| 6019 | qwen2-0.5b-instruct | 0.095 |
| 5031 | qwen2-0.5b-instruct | 0.095 |
| 5185 | qwen2-0.5b-instruct | 0.095 |
| 11440 | qwen2-0.5b-instruct | 0.095 |
| 1329 | qwen2-0.5b-instruct | 0.095 |
| 8138 | qwen2-0.5b-instruct | 0.095 |
| 1581 | qwen2-0.5b-instruct | 0.095 |
| 6818 | qwen2-0.5b-instruct | 0.095 |
| 9853 | qwen2-0.5b-instruct | 0.095 |
| 5944 | qwen2.5-coder-0.5b-instruct | 0.093 |
| 1325 | qwen2.5-coder-0.5b-instruct | 0.093 |
| 1303 | qwen2.5-coder-0.5b-instruct | 0.093 |
| 9739 | qwen2.5-coder-0.5b-instruct | 0.093 |
| 7036 | qwen2.5-coder-0.5b-instruct | 0.093 |
| 2700 | qwen2.5-coder-0.5b-instruct | 0.093 |
| 6363 | qwen2.5-coder-0.5b-instruct | 0.093 |
| 6805 | qwen1.5-1.8b-chat | 0.088 |
| 616 | qwen1.5-1.8b-chat | 0.088 |
| 13 | qwen1.5-1.8b-chat | 0.088 |
| 6742 | qwen1.5-1.8b-chat | 0.088 |
| 9832 | qwen1.5-0.5b-chat | 0.055 |
| 5528 | qwen1.5-0.5b-chat | 0.055 |
| 5865 | qwen1.5-0.5b-chat | 0.055 |
| 5144 | qwen1.5-0.5b-chat | 0.055 |
| 5349 | qwen1.5-0.5b-chat | 0.055 |
| 7241 | qwen1.5-0.5b-chat | 0.055 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 5495 | 0.111 | -0.494 |
| 3512 | 0.039 | -0.422 |
| 9882 | 0.117 | -0.422 |
| 5056 | 0.072 | -0.417 |
| 956 | 0.024 | -0.406 |
| 4691 | 0.094 | -0.403 |
| 350 | 0.070 | -0.401 |
| 425 | 0.091 | -0.394 |
| 4192 | 0.124 | -0.388 |
| 8163 | 0.026 | -0.387 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.