swebench-lite: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
20250625_ExpeRepair-v1_claude-4-sonnet-20250514 60.3 29.5 1 2.8 NaN NaN
20250425_Refact_Agent 60 30.3 1 2.8 NaN NaN
20250906_KGCompass_claude-4-sonnet-20250514 58.3 29.9 1 2.8 NaN NaN
20250526_sweagent_claude-4-sonnet-20250514 56.7 27.9 1 2.9 NaN NaN
20250114_Isoform 55 25.9 1 2.9 NaN NaN
20250625_SemAgent_Multi-v1_Claude3.7Sonnet_Gemini2.5Pro 51.7 22.9 1 2.9 NaN NaN
20250911_isea_claude-3.5-sonnet-20241022 51.3 22.3 1 2.9 NaN NaN
20250901_entroPO_R2E_QwenCoder30BA3B_tts 49.7 21.4 1 2.9 NaN NaN
20250528_Codev 49 20.2 1 2.9 NaN NaN
20241220_blackboxai_agent_v1 49 21.2 1 2.9 NaN NaN
20241208_gru 48.7 20.3 1 2.9 NaN NaN
20250613_ExpeRepair-v1.0 48.3 20.8 1 2.9 NaN NaN
20241127_globant_codefixer_agent 48.3 21.3 1 2.9 NaN NaN
20250226_sweagent_claude-3-7-sonnet-20250219 48 20 1 2.9 NaN NaN
20241122_devlo 47.3 19.7 1 2.9 NaN NaN
20250205_dars_agent_claude_3.5_sonnet_deepseek_r1 47 19.9 1 2.9 NaN NaN
20250619_KGCompass_claude-3.5-sonnet-20241022 46 21.1 1 2.9 NaN NaN
20250901_entroPO_R2E_QwenCoder30BA3B 45 18.4 1 2.9 NaN NaN
20241207_kodu_sonnet_v1 44.7 22.1 1 2.9 NaN NaN
20250310_codefuse-cgm 44 16.7 1 2.9 NaN NaN
20240702_codestory_aide_mixed 43 16.8 1 2.9 NaN NaN
20250509_Lingxi_claude-3-5-sonnet-20241022 42.7 17 1 2.9 NaN NaN
20250515_codartai 41.7 19.9 1 2.8 NaN NaN
20241025_OpenHands-CodeAct-2.1-sonnet-20241022 41.7 16.9 1 2.8 NaN NaN
20241220_PatchKitty-0.9_claude-3.5-sonnet-20241022 41.3 15.3 1 2.8 NaN NaN
20241030_composio_swekit 41 15.1 1 2.8 NaN NaN
20250113_OrcaLoca 41 15.4 1 2.8 NaN NaN
20241202_agentless-1.5_claude-3.5-sonnet-20241022 40.7 15.1 1 2.8 NaN NaN
20250113_OpenCSG-Starship-Agentic-Coder_gpt4o 39.7 14.8 1 2.8 NaN NaN
20240912_marscode-agent-dev 39.3 14.9 1 2.8 NaN NaN
20250114_moatless_claude-3.5-sonnet-20241022 39 15.1 1 2.8 NaN NaN
20241117_moatless_claude-3.5-sonnet-20241022 38.3 13.2 1 2.8 NaN NaN
20240820_honeycomb 38.3 14.2 1 2.8 NaN NaN
20240627_abanteai_mentatbot_gpt4o 38 15.2 1 2.8 NaN NaN
20250104_patched_codes_claude-3.5-sonnet-20241022 37 13.7 1 2.8 NaN NaN
20250609_KGCompass_deepseek-v3 36.7 15.3 1 2.8 NaN NaN
20241113_navie-2-gpt4o-sonnet 36 12.9 1 2.8 NaN NaN
20240811_gru 35.7 12.6 1 2.8 NaN NaN
20250104_codefuse-aais 35.7 12.3 1 2.8 NaN NaN
20240829_Isoform 35 11.8 1 2.8 NaN NaN
20240723_marscode-agent-dev 34 12.1 1 2.7 NaN NaN
20240806_SuperCoder2.0 34 13.4 1 2.7 NaN NaN
20240622_Lingma_Agent 33 10.9 1 2.7 NaN NaN
20250214_agentless_lite_o3_mini 32.3 12.5 1 2.7 NaN NaN
20241028_agentless-1.5_gpt4o 32 10.6 1 2.7 NaN NaN
20241111_codeshelltester_gpt4o 31.3 10.9 1 2.7 NaN NaN
20240617_factory_code_droid 31.3 11.2 1 2.7 NaN NaN
20250111_moatless_deepseek_v3 30.7 9.51 1 2.7 NaN NaN
20240621_autocoderover-v20240620 30.7 9.82 1 2.7 NaN NaN
20250207_aegis_o3mini 30.3 11.3 1 2.7 NaN NaN
20240908_infant_gpt4o 30 10.2 1 2.6 NaN NaN
20241203_KortixAI-AgentPress-sonnet-20241022 30 10.6 1 2.6 NaN NaN
20240808_RepoGraph_gpt4o 29.7 9.63 1 2.6 NaN NaN
20240721_amazon-q-developer-agent-20240719-dev 29.7 10 1 2.6 NaN NaN
20240604_CodeR 28.3 8.98 1 2.6 NaN NaN
20241117_reproducedRG_gpt4o 28 8.37 1 2.6 NaN NaN
20240706_sima_gpt4o 27.7 8.38 1 2.6 NaN NaN
20240612_MASAI_gpt4o 27.3 8.41 1 2.6 NaN NaN
20240630_agentless_gpt4o 27.3 8.49 1 2.6 NaN NaN
20240612_IBM_Research_Agent101 26.7 7.8 1 2.6 NaN NaN
20240725_opendevin_codeact_v1.8_claude35sonnet 26.7 8.93 1 2.6 NaN NaN
20240623_moatless_claude35sonnet 26.7 7.71 1 2.6 NaN NaN
20240523_aider 26.3 8.27 1 2.5 NaN NaN
20240925_hyperagent_lite1 25.3 8.04 1 2.5 NaN NaN
20250306_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor 24.7 7.3 1 2.5 NaN NaN
20240617_moatless_gpt4o 24.7 7.17 1 2.5 NaN NaN
20240524_opencsg_starship_gpt4 23.7 6.93 1 2.5 NaN NaN
20241016_IBM-SWE-1.0 23.7 6.5 1 2.5 NaN NaN
20241128_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor_20241128 23.3 6.91 1 2.4 NaN NaN
20240620_sweagent_claude3.5sonnet 23 7.52 1 2.4 NaN NaN
20240828_autose_mixed 21.7 5.67 1 2.4 NaN NaN
20240615_appmap-navie_gpt4o 21.7 6.45 1 2.4 NaN NaN
20240509_amazon-q-developer-agent-20240430-dev 20.3 6.35 1 2.3 NaN NaN
20240530_autocoderover-v20240408 19 5.31 1 2.3 NaN NaN
20240728_sweagent_gpt4o 18.3 4.86 1 2.2 NaN NaN
20240402_sweagent_gpt4 18 4.62 1 2.2 NaN NaN
20250627_agentless_MCTS-Refine-7B 16.3 6.21 1 2.1 NaN NaN
20240402_sweagent_claude3opus 11.7 2.6 1 1.9 NaN NaN
20240402_rag_claude3opus 4.33 0.59 1 1.2 NaN NaN
20231010_rag_claude2 3 0.482 1 0.98 NaN NaN
20240402_rag_gpt4 2.67 0.498 1 0.93 NaN NaN
20231010_rag_swellama7b 1.33 0.394 1 0.66 NaN NaN
20231010_rag_swellama13b 1 0.261 1 0.57 NaN NaN
20231010_rag_gpt35 0.333 0.0201 1 0.33 NaN NaN