There are 6 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
10, 11, 23, 28, 29, 9
| example_link | model | min_pass1_of_model |
|---|---|---|
| 7 | deepseek_r1_distill_qwen_32b | 0.379 |
| 22 | qwen3-8b | 0.175 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 22 | 0.003 | 0.116 |
| 7 | 0.002 | 0.203 |
| 4 | 0.005 | 0.226 |
| 8 | 0.010 | 0.288 |
| 25 | 0.026 | 0.296 |
| 21 | 0.005 | 0.306 |
| 13 | 0.008 | 0.329 |
| 12 | 0.010 | 0.335 |
| 19 | 0.076 | 0.389 |
| 16 | 0.073 | 0.392 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.