There are 8 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
11801, 3265, 5029, 5528, 5529, 5724, 9319, 9551
| example_link | model | min_pass1_of_model |
|---|---|---|
| 5706 | qwen2-math-72b-instruct | 0.511 |
| 11637 | qwen2.5-coder-7b-instruct | 0.377 |
| 6355 | qwen2-math-7b-instruct | 0.331 |
| 1325 | qwen2-math-1.5b-instruct | 0.251 |
| 6663 | qwen2-math-1.5b-instruct | 0.251 |
| 5324 | qwen3-0.6b | 0.238 |
| 1402 | qwen3-0.6b | 0.238 |
| 6232 | mistralai_mistral_7b_instruct_v0.1 | 0.238 |
| 9571 | mistralai_mistral_7b_instruct_v0.1 | 0.238 |
| 5367 | deepseek_r1_distill_qwen_1.5b | 0.205 |
| 1468 | deepseek_r1_distill_qwen_1.5b | 0.205 |
| 1789 | qwen2.5-coder-1.5b-instruct | 0.203 |
| 9649 | qwen2-1.5b-instruct | 0.172 |
| 7254 | qwen2-1.5b-instruct | 0.172 |
| 11641 | qwen1.5-1.8b-chat | 0.124 |
| 11422 | qwen2.5-coder-0.5b-instruct | 0.104 |
| 3282 | qwen2.5-coder-0.5b-instruct | 0.104 |
| 3519 | qwen2.5-coder-0.5b-instruct | 0.104 |
| 8851 | qwen2.5-coder-0.5b-instruct | 0.104 |
| 5081 | qwen2.5-coder-0.5b-instruct | 0.104 |
| 2677 | qwen1.5-0.5b-chat | 0.103 |
| 4202 | qwen1.5-0.5b-chat | 0.103 |
| 7386 | qwen1.5-0.5b-chat | 0.103 |
| 8042 | qwen1.5-0.5b-chat | 0.103 |
| 9880 | qwen1.5-0.5b-chat | 0.103 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| 10838 | 0.096 | -0.605 |
| 4881 | 0.087 | -0.563 |
| 5495 | 0.158 | -0.563 |
| 9882 | 0.080 | -0.552 |
| 6377 | 0.045 | -0.543 |
| 3730 | 0.083 | -0.539 |
| 8526 | 0.090 | -0.537 |
| 5889 | 0.069 | -0.526 |
| 7595 | 0.027 | -0.519 |
| 91 | 0.131 | -0.517 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.