There are 3 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
154, 422, 96
| example_link | model | min_pass1_of_model |
|---|---|---|
| 284 | qwen3-32b | 0.819 |
| 286 | deepseek_r1_distill_qwen_14b | 0.789 |
| 264 | mistralai_mistral_7b_instruct_v0.2 | 0.101 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 204 | 0.090 | -0.140 |
| 383 | 0.009 | -0.139 |
| 264 | 0.002 | -0.135 |
| 305 | 0.009 | -0.132 |
| 110 | 0.010 | -0.059 |
| 420 | 0.117 | 0.006 |
| 444 | 0.058 | 0.052 |
| 340 | 0.046 | 0.061 |
| 240 | 0.008 | 0.089 |
| 460 | 0.121 | 0.109 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.