There are 8 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
dna-insert, install-windows-3.11, make-doom-for-mips, make-mips-interpreter, regex-chess, sam-cell-seg, torch-pipeline-parallelism, torch-tensor-parallelism
| example_link | model | min_pass1_of_model |
|---|---|---|
| polyglot-c-py | Ante__Gemini-3-Pro-Preview | 0.647 |
| path-tracing | Ante__Gemini-3-Pro-Preview | 0.647 |
| mteb-retrieve | Ante__Gemini-3-Pro-Preview | 0.647 |
| protein-assembly | Ante__Gemini-3-Pro-Preview | 0.647 |
| video-processing | Ante__Gemini-3-Pro-Preview | 0.647 |
| write-compressor | Ante__Gemini-3-Pro-Preview | 0.647 |
| polyglot-rust-c | Ante__Gemini-3-Pro-Preview | 0.647 |
| filter-js-from-html | MAYA__Claude-4.5-sonnet | 0.427 |
| gpt2-codegolf | MAYA__Claude-4.5-sonnet | 0.427 |
| raman-fitting | MAYA__Claude-4.5-sonnet | 0.427 |
| model-extraction-relu-logits | dakou__qwen3-coder-480b | 0.272 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| model-extraction-relu-logits | 0.086 | -0.535 |
| train-fasttext | 0.200 | -0.461 |
| crack-7z-hash | 0.886 | -0.197 |
| filter-js-from-html | 0.143 | -0.178 |
| gpt2-codegolf | 0.143 | -0.178 |
| raman-fitting | 0.143 | -0.178 |
| configure-git-webserver | 0.400 | -0.159 |
| build-pmars | 0.629 | -0.117 |
| query-optimize | 0.286 | -0.056 |
| rstan-to-pystan | 0.429 | -0.053 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.