There are 9 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
110, 154, 240, 286, 306, 308, 422, 43, 96
| example_link | model | min_pass1_of_model |
|---|---|---|
| 264 | qwen3-14b | 0.807 |
| 383 | qwen1.5-14b-chat | 0.283 |
| 305 | mistralai_mixtral_8x7b_instruct_v0.1 | 0.233 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 204 | 0.096 | -0.276 |
| 305 | 0.007 | -0.079 |
| 383 | 0.007 | -0.063 |
| 94 | 0.046 | -0.002 |
| 456 | 0.184 | 0.014 |
| 420 | 0.126 | 0.103 |
| 340 | 0.064 | 0.104 |
| 460 | 0.070 | 0.128 |
| 485 | 0.062 | 0.167 |
| 372 | 0.051 | 0.169 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.