There are 5 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
129, 218, 220, 501, 620
| example_link | model | min_pass1_of_model |
|---|---|---|
| 229 | qwen2.5-coder-32b-instruct | 0.761 |
| 250 | qwen2-72b-instruct | 0.569 |
| 413 | qwen2.5-coder-7b-instruct | 0.549 |
| 444 | qwen2.5-coder-3b-instruct | 0.433 |
| 729 | qwen1.5-32b-chat | 0.414 |
| 112 | qwen2-math-1.5b-instruct | 0.169 |
| 177 | qwen2-1.5b-instruct | 0.151 |
| 185 | qwen1.5-0.5b-chat | 0.017 |
| 128 | qwen1.5-0.5b-chat | 0.017 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 236 | 0.006 | -0.186 |
| 128 | 0.001 | -0.181 |
| 185 | 0.001 | -0.181 |
| 320 | 0.003 | -0.161 |
| 113 | 0.004 | -0.161 |
| 177 | 0.002 | -0.142 |
| 438 | 0.086 | -0.136 |
| 112 | 0.002 | -0.135 |
| 687 | 0.003 | -0.123 |
| 273 | 0.132 | -0.119 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.