There are 11 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
10, 11, 21, 22, 23, 25, 28, 29, 7, 8, 9
| example_link | model | min_pass1_of_model |
|---|---|---|
| 13 | deepseek_r1_distill_llama_70b | 0.383 |
| 4 | mistralai_mixtral_8x7b_instruct_v0.1 | 0.022 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 4 | 0.007 | 0.008 |
| 13 | 0.010 | 0.210 |
| 26 | 0.075 | 0.241 |
| 12 | 0.020 | 0.294 |
| 19 | 0.069 | 0.311 |
| 24 | 0.021 | 0.320 |
| 16 | 0.042 | 0.384 |
| 27 | 0.036 | 0.415 |
| 17 | 0.060 | 0.445 |
| 14 | 0.090 | 0.525 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.