The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | pass@count | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|---|
| 20251117_codefuse_pycfuse_svr | 36.5 | 36.5 | 7.5 | 1 | 2.1 | NaN | NaN |
| 20250701_GUIRepair_o3 | 36.5 | 36.5 | 7.59 | 1 | 2.1 | NaN | NaN |
| 20250611_Refact_Agent_claude-4-sonnet | 36.1 | 36.1 | 7.17 | 1 | 2.1 | NaN | NaN |
| 20250528_OpenHands-Versa-claude4 | 34.9 | 34.9 | 6.63 | 1 | 2.1 | NaN | NaN |
| 20250531_GUIRepair_o4mini | 34.3 | 34.3 | 6.45 | 1 | 2.1 | NaN | NaN |
| 20250509_OpenHands-Versa_claude3.7 | 31.8 | 31.8 | 4.81 | 1 | 2.1 | NaN | NaN |
| 20250531_GUIRepair_gpt41 | 31.6 | 31.6 | 4.26 | 1 | 2.1 | NaN | NaN |
| 20250401_zencoder | 31 | 31 | 3.67 | 1 | 2 | NaN | NaN |
| 20250531_GUIRepair_gpt4o | 30.8 | 30.8 | 3.67 | 1 | 2 | NaN | NaN |
| 20250325_globant_codefixer_agent | 30 | 30 | 3.53 | 1 | 2 | NaN | NaN |
| 20250311_zencoder | 27.5 | 27.5 | 3.08 | 1 | 2 | NaN | NaN |
| 20250226_agentless_lite_claude3.5 | 25.7 | 25.7 | 1.64 | 1 | 1.9 | NaN | NaN |