The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | pass@count | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|---|
| 20250928_trae_doubao_seed_code | 78.8 | 78.8 | 30.6 | 1 | 1.8 | NaN | NaN |
| 20251120_livesweagent_gemini-3-pro-preview | 77.4 | 77.4 | 29.3 | 1 | 1.9 | NaN | NaN |
| 20250804_epam-ai-run-claude-4-sonnet | 76.8 | 76.8 | 28.6 | 1 | 1.9 | NaN | NaN |
| 20250902_atlassian-rovo-dev | 76.8 | 76.8 | 28.5 | 1 | 1.9 | NaN | NaN |
| 20250819_ACoder | 76.4 | 76.4 | 28.2 | 1 | 1.9 | NaN | NaN |
| 20250901_warp | 75.6 | 75.6 | 28.1 | 1 | 1.9 | NaN | NaN |
| 20250612_trae | 75.2 | 75.2 | 27.2 | 1 | 1.9 | NaN | NaN |
| 20251103_sonar-foundation-agent_claude-sonnet-4-5 | 74.8 | 74.8 | 27.4 | 1 | 1.9 | NaN | NaN |
| 20250731_harness_ai | 74.8 | 74.8 | 26.2 | 1 | 1.9 | NaN | NaN |
| 20250915_JoyCode | 74.6 | 74.6 | 27.3 | 1 | 1.9 | NaN | NaN |
| 20250720_Lingxi-v1.5_claude-4-sonnet-20250514 | 74.6 | 74.6 | 26.4 | 1 | 1.9 | NaN | NaN |
| 20251015_Prometheus_v1.2.1_gpt5 | 74.4 | 74.4 | 27.5 | 1 | 2 | NaN | NaN |
| 20250603_Refact_Agent_claude-4-sonnet | 74.4 | 74.4 | 26.4 | 1 | 2 | NaN | NaN |
| 20251103_SalesforceAIResearch_SAGE_OpenHands | 73.8 | 73.8 | 26.6 | 1 | 2 | NaN | NaN |
| 20250522_tools_claude-4-opus | 73.2 | 73.2 | 26.6 | 1 | 2 | NaN | NaN |
| 20251021_SalesforceAIResearch_SAGE_bash_only | 73 | 73 | 26.3 | 1 | 2 | NaN | NaN |
| 20250522_tools_claude-4-sonnet | 72.4 | 72.4 | 25.5 | 1 | 2 | NaN | NaN |
| 20250807_openhands_gpt5 | 71.8 | 71.8 | 25.1 | 1 | 2 | NaN | NaN |
| 20250715_qodo_command | 71.2 | 71.2 | 24.5 | 1 | 2 | NaN | NaN |
| 20250929_Prometheus_v1.2_gpt5 | 71.2 | 71.2 | 25.2 | 1 | 2 | NaN | NaN |
| 20251014_Lingxi_kimi_k2 | 71.2 | 71.2 | 24.3 | 1 | 2 | NaN | NaN |
| 20250710_bloop | 71.2 | 71.2 | 24.3 | 1 | 2 | NaN | NaN |
| 20250623_warp | 71 | 71 | 24.4 | 1 | 2 | NaN | NaN |
| 20250611_moatless_claude-4-sonnet-20250514 | 70.8 | 70.8 | 23.7 |