terminal-bench-1.0: by examples

Results

Not solved by any model

There are 14 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
build-initramfs-qemu, extract-moves-from-video, gpt2-codegolf, intrusion-detection, play-zork, polyglot-c-py, qemu-alpine-ssh, raman-fitting, raman-fitting.easy, reshard-c4-data, run-pdp11-code, super-benchmark-upet, train-fasttext, write-compressor

Problems solved by 1 model only

example_link	model	min_pass1_of_model
build-linux-kernel-qemu	20251016_Chaterm_claude-4-5-sonnet	0.637

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
download-youtube	0.064	-0.248
git-multibranch	0.048	-0.161
path-tracing-reverse	0.040	-0.076
configure-git-webserver	0.384	-0.015
polyglot-rust-c	0.016	0.163
cron-broken-network	0.176	0.168
jupyter-notebook-server	0.176	0.187
hello-world	0.928	0.188
simple-sheets-put	0.824	0.238
sanitize-git-repo	0.328	0.239

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.