terminal-bench-2.0: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
Ante__Gemini-3-Pro-Preview 64.7 82 26.9 5 5.1 4 3.1
Mux__GPT-5.2 60.7 60.7 23.1 1 5.2 NaN NaN
Mux__Claude-Opus-4.5 58.4 58.4 21 1 5.2 NaN NaN
OpenCode__Claude-Opus-4.5 51.7 51.7 17.3 1 5.3 NaN NaN
MAYA__Claude-4.5-sonnet 42.7 42.7 18.7 5 5.2 5.2 0
Terminus2__GLM-4.7 33 50.6 8.85 5 5 3.9 3.1
dakou__qwen3-coder-480b 27.2 42.7 7.53 5 4.7 3.7 2.9