There are 1 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
25
| example_link | model | min_pass1_of_model |
|---|---|---|
| 10 | llama-3.2-3B-instruct | 0.008 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 17 | 0.000 | -0.204 |
| 9 | 0.005 | -0.126 |
| 6 | 0.000 | -0.032 |
| 10 | 0.000 | -0.012 |
| 27 | 0.000 | 0.039 |
| 14 | 0.000 | 0.130 |
| 13 | 0.004 | 0.161 |
| 29 | 0.003 | 0.174 |
| 22 | 0.001 | 0.179 |
| 26 | 0.014 | 0.302 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.