There are 10 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1128, 1139, 1295, 1350, 1389, 1459, 1499, 1628, 2323, 337
| example_link | model | min_pass1_of_model |
|---|---|---|
| 1594 | llama-3.1-70B-instruct | 0.861 |
| 2062 | qwen2.5-coder-14b-instruct | 0.688 |
| 2337 | qwen1.5-32b-chat | 0.592 |
| 1087 | qwen2-7b-instruct | 0.570 |
| 255 | qwen2-7b-instruct | 0.570 |
| 1110 | qwen1.5-7b-chat | 0.319 |
| 1832 | qwen1.5-7b-chat | 0.319 |
| 1284 | mistralai_mistral_7b_instruct_v0.1 | 0.177 |
| 2554 | qwen2-1.5b-instruct | 0.146 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 2542 | 0.195 | -0.422 |
| 1327 | 0.048 | -0.410 |
| 2634 | 0.021 | -0.353 |
| 2589 | 0.207 | -0.340 |
| 1339 | 0.133 | -0.337 |
| 2339 | 0.109 | -0.329 |
| 345 | 0.026 | -0.320 |
| 370 | 0.036 | -0.309 |
| 843 | 0.013 | -0.291 |
| 1589 | 0.132 | -0.289 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.