terminal-bench-1.0: by models

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
20251019_apex_agent_claude-4-5-sonnet	64.2	67.5	21.7	5	5.4	5.1	1.7
20251016_Chaterm_claude-4-5-sonnet	63.7	67.5	22.3	5	5.4	5.1	1.7
20251108_abacusai-desktop_multiple	62.2	71.2	21.9	5	5.4	4.8	2.6
20251013_ante_claude_4-5_sonnet	60.2	70	19.7	5	5.5	4.9	2.4
20250923_droid_claude-4-1-opus	58.8	68.8	19.2	5	5.5	4.8	2.7
20251001_droid_claude-4-5-sonnet	57.5	67.5	17.7	5	5.5	4.9	2.6
ob1-09-10-25	56.8	62.5	20.2	5	5.5	5.1	2.1
20250926_ante_claude-4-sonnet	54.8	67.5	16.8	5	5.6	4.8	2.8
20250924_droid_gpt-5	52.5	66.2	15.5	5	5.6	4.6	3.2
20251010_Chaterm_claude-4-5-sonnet	52.5	66.2	15.1	5	5.6	4.7	3
20251017_deepagent-desktop_claude-4-5-sonnet	50.5	58.8	13.9	5	5.6	5.1	2.3
20250923_droid_claude-4-sonnet	50.5	65	14.4	5	5.6	4.8	2.9
20251016_apex_agent_gpt-5	49.2	66.2	14.9	5	5.6	4.5	3.3
20250911_chaterm_claude-4-sonnet	49.2	63.7	13.7	5	5.6	4.7	3
20250829_goose_claude-4-opus	45.2	56.2	11.5	5	5.6	4.9	2.6
20251111_iflow-cli_Minimax-M2	42	60	11	5	5.5	4.4	3.3
20250711_openhands_claude-4-sonnet	41.2	53.8	10.5	5	5.5	4.9	2.6
20250829_goose_claude-4-sonnet	41.2	51.2	10.1	5	5.5	4.9	2.5
20251026_iflow-cli_Qwen3-Coder-480A30	39	56.2	9.98	5	5.5	4.3	3.3
20251012_alpha_claude-4-5-sonnet	38.2	56.2	9.42	5	5.4	4.4	3.1
20250902_orchestrator_claude-4-sonnet	36	57.5	8.76	5	5.4	4	3.6
20251007_camel-agent_gpt-4-1	35	51.2	8.3	5	5.3	4.4	3
20250811_cursor-cli_claude-4-sonnet	26.2	40	5.7	5	4.9	4	2.9
20250902_orchestrator_qwen-3-coder-480B	19.2	38.8	4.15	5	4.4	3.1	3.2
20250825_swe-agent-mini_claude-4-sonnet	12.8	22.5	3.35	5	3.7	2.7	2.6