swebench-verified: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
20250928_trae_doubao_seed_code 78.8 78.8 30.6 1 1.8 NaN NaN
20251120_livesweagent_gemini-3-pro-preview 77.4 77.4 29.3 1 1.9 NaN NaN
20250804_epam-ai-run-claude-4-sonnet 76.8 76.8 28.6 1 1.9 NaN NaN
20250902_atlassian-rovo-dev 76.8 76.8 28.5 1 1.9 NaN NaN
20250819_ACoder 76.4 76.4 28.2 1 1.9 NaN NaN
20250901_warp 75.6 75.6 28.1 1 1.9 NaN NaN
20250612_trae 75.2 75.2 27.2 1 1.9 NaN NaN
20251103_sonar-foundation-agent_claude-sonnet-4-5 74.8 74.8 27.4 1 1.9 NaN NaN
20250731_harness_ai 74.8 74.8 26.2 1 1.9 NaN NaN
20250915_JoyCode 74.6 74.6 27.3 1 1.9 NaN NaN
20250720_Lingxi-v1.5_claude-4-sonnet-20250514 74.6 74.6 26.4 1 1.9 NaN NaN
20251015_Prometheus_v1.2.1_gpt5 74.4 74.4 27.5 1 2 NaN NaN
20250603_Refact_Agent_claude-4-sonnet 74.4 74.4 26.4 1 2 NaN NaN
20251103_SalesforceAIResearch_SAGE_OpenHands 73.8 73.8 26.6 1 2 NaN NaN
20250522_tools_claude-4-opus 73.2 73.2 26.6 1 2 NaN NaN
20251021_SalesforceAIResearch_SAGE_bash_only 73 73 26.3 1 2 NaN NaN
20250522_tools_claude-4-sonnet 72.4 72.4 25.5 1 2 NaN NaN
20250807_openhands_gpt5 71.8 71.8 25.1 1 2 NaN NaN
20250715_qodo_command 71.2 71.2 24.5 1 2 NaN NaN
20250929_Prometheus_v1.2_gpt5 71.2 71.2 25.2 1 2 NaN NaN
20251014_Lingxi_kimi_k2 71.2 71.2 24.3 1 2 NaN NaN
20250710_bloop 71.2 71.2 24.3 1 2 NaN NaN
20250623_warp 71 71 24.4 1 2 NaN NaN
20250611_moatless_claude-4-sonnet-20250514 70.8 70.8 23.7