There are 8 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1, 10, 12, 14, 17, 25, 27, 6
| example_link | model | min_pass1_of_model |
|---|---|---|
| 28 | deepseek_r1_distill_llama_70b | 0.258 |
| 11 | qwen3-14b | 0.181 |
| 9 | qwen3-4b | 0.164 |
| 29 | qwen2.5-coder-14b-instruct | 0.078 |
| 22 | qwen2-72b-instruct | 0.036 |
| 13 | qwen1.5-72b-chat | 0.003 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 13 | 0.002 | -0.049 |
| 22 | 0.002 | 0.061 |
| 29 | 0.002 | 0.069 |
| 9 | 0.002 | 0.117 |
| 11 | 0.002 | 0.142 |
| 28 | 0.002 | 0.206 |
| 23 | 0.021 | 0.279 |
| 24 | 0.005 | 0.285 |
| 19 | 0.018 | 0.287 |
| 21 | 0.012 | 0.288 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.