swebench-multimodal: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
20251117_codefuse_pycfuse_svr 36.5 36.5 7.5 1 2.1 NaN NaN
20250701_GUIRepair_o3 36.5 36.5 7.59 1 2.1 NaN NaN
20250611_Refact_Agent_claude-4-sonnet 36.1 36.1 7.17 1 2.1 NaN NaN
20250528_OpenHands-Versa-claude4 34.9 34.9 6.63 1 2.1 NaN NaN
20250531_GUIRepair_o4mini 34.3 34.3 6.45 1 2.1 NaN NaN
20250509_OpenHands-Versa_claude3.7 31.8 31.8 4.81 1 2.1 NaN NaN
20250531_GUIRepair_gpt41 31.6 31.6 4.26 1 2.1 NaN NaN
20250401_zencoder 31 31 3.67 1 2 NaN NaN
20250531_GUIRepair_gpt4o 30.8 30.8 3.67 1 2 NaN NaN
20250325_globant_codefixer_agent 30 30 3.53 1 2 NaN NaN
20250311_zencoder 27.5 27.5 3.08 1 2 NaN NaN
20250226_agentless_lite_claude3.5 25.7 25.7 1.64 1 1.9 NaN NaN