terminal-bench-2.0: by examples

Results Paper Code

Not solved by any model

There are 8 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
dna-insert, install-windows-3.11, make-doom-for-mips, make-mips-interpreter, regex-chess, sam-cell-seg, torch-pipeline-parallelism, torch-tensor-parallelism

Problems solved by 1 model only

example_link	model	min_pass1_of_model
polyglot-c-py	Ante__Gemini-3-Pro-Preview	0.647
path-tracing	Ante__Gemini-3-Pro-Preview	0.647
mteb-retrieve	Ante__Gemini-3-Pro-Preview	0.647
protein-assembly	Ante__Gemini-3-Pro-Preview	0.647
video-processing	Ante__Gemini-3-Pro-Preview	0.647
write-compressor	Ante__Gemini-3-Pro-Preview	0.647
polyglot-rust-c	Ante__Gemini-3-Pro-Preview	0.647
filter-js-from-html	MAYA__Claude-4.5-sonnet	0.427
gpt2-codegolf	MAYA__Claude-4.5-sonnet	0.427
raman-fitting	MAYA__Claude-4.5-sonnet	0.427
model-extraction-relu-logits	dakou__qwen3-coder-480b	0.272

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
model-extraction-relu-logits	0.086	-0.535
train-fasttext	0.200	-0.461
crack-7z-hash	0.886	-0.197
filter-js-from-html	0.143	-0.178
gpt2-codegolf	0.143	-0.178
raman-fitting	0.143	-0.178
configure-git-webserver	0.400	-0.159
build-pmars	0.629	-0.117
query-optimize	0.286	-0.056
rstan-to-pystan	0.429	-0.053

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.