terminal-bench-1.0: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
20251019_apex_agent_claude-4-5-sonnet 64.2 67.5 21.7 5 5.4 5.1 1.7
20251016_Chaterm_claude-4-5-sonnet 63.7 67.5 22.3 5 5.4 5.1 1.7
20251108_abacusai-desktop_multiple 62.2 71.2 21.9 5 5.4 4.8 2.6
20251013_ante_claude_4-5_sonnet 60.2 70 19.7 5 5.5 4.9 2.4
20250923_droid_claude-4-1-opus 58.8 68.8 19.2 5 5.5 4.8 2.7
20251001_droid_claude-4-5-sonnet 57.5 67.5 17.7 5 5.5 4.9 2.6
ob1-09-10-25 56.8 62.5 20.2 5 5.5 5.1 2.1
20250926_ante_claude-4-sonnet 54.8 67.5 16.8 5 5.6 4.8 2.8
20250924_droid_gpt-5 52.5 66.2 15.5 5 5.6 4.6 3.2
20251010_Chaterm_claude-4-5-sonnet 52.5 66.2 15.1 5 5.6 4.7 3
20251017_deepagent-desktop_claude-4-5-sonnet 50.5 58.8 13.9 5 5.6 5.1 2.3
20250923_droid_claude-4-sonnet 50.5 65 14.4 5 5.6 4.8 2.9
20251016_apex_agent_gpt-5 49.2 66.2 14.9 5 5.6 4.5 3.3
20250911_chaterm_claude-4-sonnet 49.2 63.7 13.7 5 5.6 4.7 3
20250829_goose_claude-4-opus 45.2 56.2 11.5 5 5.6 4.9 2.6
20251111_iflow-cli_Minimax-M2 42 60 11 5 5.5 4.4 3.3
20250711_openhands_claude-4-sonnet 41.2 53.8 10.5 5 5.5 4.9 2.6
20250829_goose_claude-4-sonnet 41.2 51.2 10.1 5 5.5 4.9 2.5
20251026_iflow-cli_Qwen3-Coder-480A30 39 56.2 9.98 5 5.5 4.3 3.3
20251012_alpha_claude-4-5-sonnet 38.2 56.2 9.42 5 5.4 4.4 3.1
20250902_orchestrator_claude-4-sonnet 36 57.5 8.76 5 5.4 4 3.6
20251007_camel-agent_gpt-4-1 35 51.2 8.3 5 5.3 4.4 3
20250811_cursor-cli_claude-4-sonnet 26.2 40 5.7 5 4.9 4 2.9
20250902_orchestrator_qwen-3-coder-480B 19.2 38.8 4.15 5 4.4 3.1 3.2
20250825_swe-agent-mini_claude-4-sonnet 12.8 22.5 3.35 5 3.7 2.7 2.6