There are 10 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
130, 139, 140, 145, 28, 32, 4, 51, 83, 86
| example_link | model | min_pass1_of_model |
|---|---|---|
| 163 | mistralai_mixtral_8x22b_instruct_v0.1 | 0.656 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 21 | 0.012 | -0.047 |
| 154 | 0.131 | -0.031 |
| 160 | 0.082 | 0.109 |
| 132 | 0.005 | 0.127 |
| 163 | 0.002 | 0.127 |
| 22 | 0.074 | 0.142 |
| 91 | 0.021 | 0.229 |
| 134 | 0.053 | 0.244 |
| 127 | 0.138 | 0.250 |
| 54 | 0.414 | 0.258 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.