There are 2 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
454, 823
| example_link | model | min_pass1_of_model |
|---|---|---|
| 1042 | google_gemma_2b_it | 0.099 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 749 | 0.063 | -0.373 |
| 962 | 0.084 | -0.253 |
| 1309 | 0.026 | -0.210 |
| 835 | 0.059 | -0.206 |
| 952 | 0.116 | -0.197 |
| 1042 | 0.002 | -0.188 |
| 403 | 0.103 | -0.077 |
| 89 | 0.384 | -0.069 |
| 1048 | 0.021 | -0.068 |
| 368 | 0.013 | -0.045 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.