terminal-bench-2.0: by examples

Results Paper Code


Not solved by any model

There are 8 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
dna-insert, install-windows-3.11, make-doom-for-mips, make-mips-interpreter, regex-chess, sam-cell-seg, torch-pipeline-parallelism, torch-tensor-parallelism

Problems solved by 1 model only

example_link model min_pass1_of_model
polyglot-c-py Ante__Gemini-3-Pro-Preview 0.647
path-tracing Ante__Gemini-3-Pro-Preview 0.647
mteb-retrieve Ante__Gemini-3-Pro-Preview 0.647
protein-assembly Ante__Gemini-3-Pro-Preview 0.647
video-processing Ante__Gemini-3-Pro-Preview 0.647
write-compressor Ante__Gemini-3-Pro-Preview 0.647
polyglot-rust-c Ante__Gemini-3-Pro-Preview 0.647
filter-js-from-html MAYA__Claude-4.5-sonnet 0.427
gpt2-codegolf MAYA__Claude-4.5-sonnet 0.427
raman-fitting MAYA__Claude-4.5-sonnet 0.427
model-extraction-relu-logits dakou__qwen3-coder-480b 0.272

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
model-extraction-relu-logits 0.086 -0.535
train-fasttext 0.200 -0.461
crack-7z-hash 0.886 -0.197
filter-js-from-html 0.143 -0.178
gpt2-codegolf 0.143 -0.178
raman-fitting 0.143 -0.178
configure-git-webserver 0.400 -0.159
build-pmars 0.629 -0.117
query-optimize 0.286 -0.056
rstan-to-pystan 0.429 -0.053

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.