There are 14 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
build-initramfs-qemu, extract-moves-from-video, gpt2-codegolf, intrusion-detection, play-zork, polyglot-c-py, qemu-alpine-ssh, raman-fitting, raman-fitting.easy, reshard-c4-data, run-pdp11-code, super-benchmark-upet, train-fasttext, write-compressor
| example_link | model | min_pass1_of_model |
|---|---|---|
| build-linux-kernel-qemu | 20251016_Chaterm_claude-4-5-sonnet | 0.637 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| download-youtube | 0.064 | -0.248 |
| git-multibranch | 0.048 | -0.161 |
| path-tracing-reverse | 0.040 | -0.076 |
| configure-git-webserver | 0.384 | -0.015 |
| polyglot-rust-c | 0.016 | 0.163 |
| cron-broken-network | 0.176 | 0.168 |
| jupyter-notebook-server | 0.176 | 0.187 |
| hello-world | 0.928 | 0.188 |
| simple-sheets-put | 0.824 | 0.238 |
| sanitize-git-repo | 0.328 | 0.239 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.