There are 1 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1042
| example_link | model | min_pass1_of_model |
|---|---|---|
| 782 | qwen2.5-coder-14b-instruct | 0.810 |
| 823 | qwen2-1.5b-instruct | 0.205 |
| 454 | qwen2.5-coder-0.5b-instruct | 0.098 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 749 | 0.033 | -0.284 |
| 454 | 0.004 | -0.173 |
| 823 | 0.005 | -0.158 |
| 962 | 0.068 | -0.122 |
| 1048 | 0.024 | -0.112 |
| 403 | 0.072 | -0.071 |
| 119 | 0.038 | -0.057 |
| 1016 | 0.086 | -0.056 |
| 952 | 0.073 | -0.045 |
| 12 | 0.049 | -0.023 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.