There are 15 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1, 10, 11, 13, 14, 17, 21, 22, 23, 24, 25, 27, 4, 6, 9
| example_link | model | min_pass1_of_model |
|---|---|---|
| 26 | deepseek_r1_distill_llama_70b | 0.267 |
| 28 | google_codegemma_1.1_7b_it | 0.013 |
| 12 | mistralai_mixtral_8x7b_instruct_v0.1 | 0.011 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 12 | 0.006 | 0.017 |
| 28 | 0.008 | 0.030 |
| 19 | 0.019 | 0.099 |
| 29 | 0.029 | 0.154 |
| 26 | 0.010 | 0.221 |
| 8 | 0.068 | 0.380 |
| 20 | 0.122 | 0.485 |
| 18 | 0.101 | 0.505 |
| 7 | 0.130 | 0.540 |
| 15 | 0.122 | 0.545 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.