swebench-test: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
20251027_salesforce_SAGE 44.2 29.9 1 1 NaN NaN
20250605_atlassian-rovo-dev 42 27.2 1 1 NaN NaN
20250522_amazon-q-developer-agent-20250405-dev 37.1 22.5 1 1 NaN NaN
20250227_sweagent-claude-3-7-20250219 33.8 20 1 0.99 NaN NaN
20250131_amazon-q-developer-agent-20241202-dev 30 16.8 1 0.96 NaN NaN
20241103_OpenHands-CodeAct-2.1-sonnet-20241022 29.4 16.6 1 0.95 NaN NaN
20241121_autocoderover-v2.0-claude-3-5-sonnet-20241022 24.9 13.3 1 0.9 NaN NaN
20240820_honeycomb 22.1 11.5 1 0.87 NaN NaN
20240721_amazon-q-developer-agent-20240719-dev 19.7 9.95 1 0.83 NaN NaN
20240617_factory_code_droid 19.3 9.46 1 0.82 NaN NaN
20240628_autocoderover-v20240620 18.8 9.28 1 0.82 NaN NaN
20240620_sweagent_claude3.5sonnet 18.1 9.05 1 0.8 NaN NaN
20240615_appmap-navie_gpt4o 14.6 6.67 1 0.74 NaN NaN
20240509_amazon-q-developer-agent-20240430-dev 13.8 6.57 1 0.72 NaN NaN
20240402_sweagent_gpt4 12.5 5.5 1 0.69 NaN NaN
20240728_sweagent_gpt4o 12 5.44 1 0.68 NaN NaN
20240402_sweagent_claude3opus 9.29 3.94 1 0.61 NaN NaN
20240402_rag_claude3opus 3.79 1.52 1 0.4 NaN NaN
20231010_rag_claude2 1.96 0.83 1 0.29 NaN NaN
20240402_rag_gpt4 1.31 0.487 1 0.24 NaN NaN
20231010_rag_swellama13b 0.697 0.254 1 0.17 NaN NaN
20231010_rag_swellama7b 0.697 0.349 1 0.17 NaN NaN
20231010_rag_gpt35 0.174 0.0614 1 0.087 NaN NaN