The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | pass@count | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|---|
| 20251019_apex_agent_claude-4-5-sonnet | 64.2 | 67.5 | 21.7 | 5 | 5.4 | 5.1 | 1.7 |
| 20251016_Chaterm_claude-4-5-sonnet | 63.7 | 67.5 | 22.3 | 5 | 5.4 | 5.1 | 1.7 |
| 20251108_abacusai-desktop_multiple | 62.2 | 71.2 | 21.9 | 5 | 5.4 | 4.8 | 2.6 |
| 20251013_ante_claude_4-5_sonnet | 60.2 | 70 | 19.7 | 5 | 5.5 | 4.9 | 2.4 |
| 20250923_droid_claude-4-1-opus | 58.8 | 68.8 | 19.2 | 5 | 5.5 | 4.8 | 2.7 |
| 20251001_droid_claude-4-5-sonnet | 57.5 | 67.5 | 17.7 | 5 | 5.5 | 4.9 | 2.6 |
| ob1-09-10-25 | 56.8 | 62.5 | 20.2 | 5 | 5.5 | 5.1 | 2.1 |
| 20250926_ante_claude-4-sonnet | 54.8 | 67.5 | 16.8 | 5 | 5.6 | 4.8 | 2.8 |
| 20250924_droid_gpt-5 | 52.5 | 66.2 | 15.5 | 5 | 5.6 | 4.6 | 3.2 |
| 20251010_Chaterm_claude-4-5-sonnet | 52.5 | 66.2 | 15.1 | 5 | 5.6 | 4.7 | 3 |
| 20251017_deepagent-desktop_claude-4-5-sonnet | 50.5 | 58.8 | 13.9 | 5 | 5.6 | 5.1 | 2.3 |
| 20250923_droid_claude-4-sonnet | 50.5 | 65 | 14.4 | 5 | 5.6 | 4.8 | 2.9 |
| 20251016_apex_agent_gpt-5 | 49.2 | 66.2 | 14.9 | 5 | 5.6 | 4.5 | 3.3 |
| 20250911_chaterm_claude-4-sonnet | 49.2 | 63.7 | 13.7 | 5 | 5.6 | 4.7 | 3 |
| 20250829_goose_claude-4-opus | 45.2 | 56.2 | 11.5 | 5 | 5.6 | 4.9 | 2.6 |
| 20251111_iflow-cli_Minimax-M2 | 42 | 60 | 11 | 5 | 5.5 | 4.4 | 3.3 |
| 20250711_openhands_claude-4-sonnet | 41.2 | 53.8 | 10.5 | 5 | 5.5 | 4.9 | 2.6 |
| 20250829_goose_claude-4-sonnet | 41.2 | 51.2 | 10.1 | 5 | 5.5 | 4.9 | 2.5 |
| 20251026_iflow-cli_Qwen3-Coder-480A30 | 39 | 56.2 | 9.98 | 5 | 5.5 | 4.3 | 3.3 |
| 20251012_alpha_claude-4-5-sonnet | 38.2 | 56.2 | 9.42 | 5 | 5.4 | 4.4 | 3.1 |
| 20250902_orchestrator_claude-4-sonnet | 36 | 57.5 | 8.76 | 5 | 5.4 | 4 | 3.6 |
| 20251007_camel-agent_gpt-4-1 | 35 | 51.2 | 8.3 | 5 | 5.3 | 4.4 | 3 |
| 20250811_cursor-cli_claude-4-sonnet | 26.2 | 40 | 5.7 | 5 | 4.9 | 4 | 2.9 |
| 20250902_orchestrator_qwen-3-coder-480B | 19.2 | 38.8 | 4.15 | 5 | 4.4 | 3.1 | 3.2 |
| 20250825_swe-agent-mini_claude-4-sonnet | 12.8 | 22.5 | 3.35 | 5 | 3.7 | 2.7 | 2.6 |