There are 11 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
130, 132, 139, 140, 145, 28, 32, 4, 51, 83, 86
| example_link | model | min_pass1_of_model |
|---|---|---|
| 163 | qwen3-32b | 0.734 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 21 | 0.023 | -0.032 |
| 154 | 0.108 | 0.006 |
| 160 | 0.074 | 0.059 |
| 127 | 0.114 | 0.158 |
| 22 | 0.075 | 0.170 |
| 163 | 0.007 | 0.174 |
| 77 | 0.095 | 0.190 |
| 91 | 0.023 | 0.228 |
| 48 | 0.296 | 0.238 |
| 121 | 0.452 | 0.248 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.