terminal-bench-1.0: by examples

Results Paper Code


Not solved by any model

There are 14 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
build-initramfs-qemu, extract-moves-from-video, gpt2-codegolf, intrusion-detection, play-zork, polyglot-c-py, qemu-alpine-ssh, raman-fitting, raman-fitting.easy, reshard-c4-data, run-pdp11-code, super-benchmark-upet, train-fasttext, write-compressor

Problems solved by 1 model only

example_link model min_pass1_of_model
build-linux-kernel-qemu 20251016_Chaterm_claude-4-5-sonnet 0.637

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
download-youtube 0.064 -0.248
git-multibranch 0.048 -0.161
path-tracing-reverse 0.040 -0.076
configure-git-webserver 0.384 -0.015
polyglot-rust-c 0.016 0.163
cron-broken-network 0.176 0.168
jupyter-notebook-server 0.176 0.187
hello-world 0.928 0.188
simple-sheets-put 0.824 0.238
sanitize-git-repo 0.328 0.239

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.