swebench-verified: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
cwm-selfplay-310 53 11.5 1 2.2 NaN NaN
cwm-selfplay-200 52.8 11.9 1 2.2 NaN NaN
cwm-selfplay-150 51.4 10.7 1 2.2 NaN NaN
cwm-selfplay-90 51 10.7 1 2.2 NaN NaN
cwm-selfplay-160 51 10.2 1 2.2 NaN NaN
cwm-selfplay_160-209 50.8 10.3 5 2.2 1.8 1.3
cwm-selfplay_170-219 50.6 10.3 5 2.2 1.8 1.3
cwm-selfplay-250 50.6 10 1 2.2 NaN NaN
cwm-selfplay-100 50.6 9.6 1 2.2 NaN NaN
cwm-selfplay-120 50.6 10.1 1 2.2 NaN NaN
cwm-selfplay-180 50.6 10.5 1 2.2 NaN NaN
cwm-selfplay-140 50.6 9.88 1 2.2 NaN NaN
cwm-selfplay_120-169 50.6 10.1 5 2.2 1.8 1.3
cwm-selfplay_140-189 50.5 10.1 5 2.2 1.8 1.4
cwm-selfplay_150-199 50.5 10.1 5 2.2 1.8 1.3
cwm-selfplay_180-229 50.5 10.1 5 2.2 1.8 1.3
cwm-selfplay-190 50.4 9.66 1 2.2 NaN NaN
cwm-selfplay-210 50.4 10.2 1 2.2 NaN NaN
cwm-selfplay_110-159 50.3 10 5 2.2 1.8 1.3
cwm-selfplay_130-179 50.2 10 5 2.2 1.8 1.4
cwm-selfplay_90-139 50.2 9.98 5 2.2 1.8 1.3
cwm-selfplay-70 50.2 9.81 1 2.2 NaN NaN
cwm-selfplay-270 50.2 9.72 1 2.2 NaN NaN
cwm-selfplay-260 50.2 9.74 1 2.2 NaN NaN
cwm-selfplay_100-149 50.2 9.82 5 2.2 1.8 1.3
cwm-selfplay_300-349 50 9.95 5 2.2 1.8 1.3
cwm-selfplay-390 50 9.91 1 2.2 NaN NaN
cwm-selfplay-320 50 9.78 1 2.2 NaN NaN
cwm-selfplay-330 50 9.71 1 2.2 NaN NaN
cwm-selfplay_190-239 50 9.69 5 2.2 1.8 1.3
cwm-selfplay_200-249 49.9 9.71 4 2.2 1.8 1.3
cwm-selfplay_80-129 49.8 9.63 5 2.2 1.8 1.3
cwm-selfplay_230-279 49.8 9.47 4 2.2 1.8 1.3
cwm-selfplay-110 49.8 9.78 1 2.2 NaN NaN
cwm-selfplay_70-119 49.8 9.57 5 2.2 1.8 1.3
cwm-selfplay_310-359 49.8 9.74 5 2.2 1.8 1.3
cwm-selfplay_290-339 49.7 9.8 5 2.2 1.8 1.3
cwm-baseline-320 49.6 8.96 1 2.2 NaN NaN
cwm-baseline-280 49.6 10.3 1 2.2 NaN NaN
cwm-selfplay_60-109 49.4 9.49 5 2.2 1.8 1.3
cwm-selfplay-50 49.4 9.01 1 2.2 NaN NaN
cwm-selfplay_210-259 49.4 9.25 4 2.2 1.8 1.3
cwm-selfplay_220-269 49.3 9.13 4 2.2 1.8 1.3
cwm-selfplay_240-289 49.2 9.39 4 2.2 1.8 1.3
cwm-baseline-380 49.2 9.91 1 2.2 NaN NaN
cwm-baseline-190 49.2 9.27 1 2.2 NaN NaN
cwm-selfplay-130 49.2 9.96 1 2.2 NaN NaN
cwm-selfplay_50-99 49.2 9.37 5 2.2 1.8 1.3
cwm-selfplay-170 49 9.59 1 2.2 NaN NaN
cwm-baseline-150 49 9.01 1 2.2 NaN NaN
cwm-baseline-300 49 8.58 1 2.2 NaN NaN
cwm-selfplay_270-319 49 9.49 5 2.2 1.8 1.4
cwm-selfplay_280-329 48.9 9.51 5 2.2 1.8 1.4
cwm-baseline_280-329 48.8 8.95 5 2.2 1.8 1.3
cwm-selfplay_250-299 48.8 9.28 5 2.2 1.8 1.3
cwm-selfplay_320-369 48.8 9.18 5 2.2 1.8 1.3
cwm-selfplay-340 48.8 9.64 1 2.2 NaN NaN
cwm-baseline-290 48.6 8.68 1 2.2 NaN NaN
cwm-baseline_270-319 48.6 9.04 5 2.2 1.8 1.3
cwm-baseline-360 48.6 9.17 1 2.2 NaN NaN
cwm-baseline_260-309 48.5 9.04 5 2.2 1.8 1.3
cwm-baseline_360-409 48.5 9.07 5 2.2 1.8 1.3
cwm-selfplay_330-379 48.5 9.16 5 2.2 1.8 1.3
cwm-baseline_250-299 48.4 9.16 5 2.2 1.8 1.4
cwm-selfplay-370 48.4 9.73 1 2.2 NaN NaN
cwm-selfplay-300 48.4 9.38 1 2.2 NaN NaN
cwm-baseline-250 48.4 9.17 1 2.2 NaN NaN
cwm-baseline-370 48.4 9.3 1 2.2 NaN NaN
cwm-baseline-140 48.4 8.66 1 2.2 NaN NaN
cwm-baseline-270 48.4 9.42 1 2.2 NaN NaN
cwm-selfplay_260-309 48.4 9.15 5 2.2 1.8 1.4
cwm-baseline-400 48.4 9.03 1 2.2 NaN NaN
cwm-selfplay_40-89 48.3 8.77 5 2.2 1.8 1.3
cwm-baseline-220 48.2 9.25 1 2.2 NaN NaN
cwm-baseline-160 48.2 8.68 1 2.2 NaN NaN
cwm-selfplay-360 48.2 8.61 1 2.2 NaN NaN
cwm-selfplay-230 48.2 8.59 1 2.2 NaN NaN
cwm-selfplay-220 48.2 8.33 1 2.2 NaN NaN
cwm-baseline_240-289 48.2 9.08 5 2.2 1.8 1.4
cwm-selfplay_360-409 48.1 8.96 4 2.2 1.8 1.4
cwm-selfplay_30-79 48.1 8.81 5 2.2 1.8 1.3
cwm-baseline_130-179 48 8.46 5 2.2 1.8 1.3
cwm-selfplay-60 48 9.36 1 2.2 NaN NaN
cwm-baseline-130 48 8.46 1 2.2 NaN NaN
cwm-baseline_350-399 48 8.92 5 2.2 1.8 1.3
cwm-baseline-390 48 8.16 1 2.2 NaN NaN
cwm-baseline_120-169 47.9 8.39 5 2.2 1.8 1.3
cwm-selfplay_350-399 47.9 8.83 5 2.2 1.8 1.4
cwm-baseline_290-339 47.9 8.59 5 2.2 1.8 1.3
cwm-baseline_210-259 47.9 8.73 5 2.2 1.8 1.3
cwm-baseline_340-389 47.8 8.97 5 2.2 1.8 1.4
cwm-baseline_230-279 47.8 8.79 5 2.2 1.8 1.4
cwm-baseline_190-239 47.8 8.69 5 2.2 1.8 1.3
cwm-baseline-230 47.8 8.83 1 2.2 NaN NaN
cwm-baseline_220-269 47.8 8.76 5 2.2 1.8 1.4
cwm-baseline_150-199 47.7 8.32 5 2.2 1.8 1.3
cwm-selfplay_340-389 47.7 8.77 5 2.2 1.8 1.3
cwm-baseline-210 47.6 8.34 1 2.2 NaN NaN
cwm-baseline_300-349 47.6 8.54 5 2.2 1.8 1.3
cwm-baseline_140-189 47.6 8.2 5 2.2 1.8 1.3
cwm-baseline_200-249 47.4 8.49 5 2.2 1.8 1.3
cwm-baseline-310 47.4 8.52 1 2.2 NaN NaN
cwm-baseline-100 47.4 7.58 1 2.2 NaN NaN
cwm-baseline-240 47.4 8.28 1 2.2 NaN NaN
cwm-baseline_180-229 47.4 8.36 5 2.2 1.8 1.3
cwm-baseline_110-159 47.3 8.07 5 2.2 1.8 1.3
cwm-selfplay_20-69 47.3 8.48 5 2.2 1.8 1.3
cwm-baseline-340 47.2 8.45 1 2.2 NaN NaN
cwm-baseline_320-369 47.2 8.61 5 2.2 1.8 1.4
cwm-selfplay-80 47.2 8.21 1 2.2 NaN NaN
cwm-selfplay-290 47.2 8.9 1 2.2 NaN NaN
cwm-baseline_160-209 47.2 8.12 5 2.2 1.8 1.3
cwm-baseline_170-219 47 8.05 5 2.2 1.8 1.3
cwm-baseline-260 47 8.51 1 2.2 NaN NaN
cwm-selfplay-350 47 8.36 1 2.2 NaN NaN
cwm-baseline_100-149 47 7.78 5 2.2 1.9 1.2
cwm-baseline_310-359 47 8.48 5 2.2 1.8 1.4
cwm-baseline_330-379 47 8.68 5 2.2 1.7 1.4
cwm-selfplay_10-59 46.6 7.94 5 2.2 1.8 1.3
cwm-selfplay-40 46.6 7.68 1 2.2 NaN NaN
cwm-baseline_90-139 46.6 7.63 5 2.2 1.9 1.2
cwm-baseline-170 46.4 7.7 1 2.2 NaN NaN
cwm-selfplay-30 46.4 8.4 1 2.2 NaN NaN
cwm-baseline-200 46.2 7.99 1 2.2 NaN NaN
cwm-selfplay-20 46.2 8.18 1 2.2 NaN NaN
cwm-baseline-60 46.2 8.15 1 2.2 NaN NaN
cwm-baseline-90 46.2 7.93 1 2.2 NaN NaN
cwm-selfplay-280 46 8.25 1 2.2 NaN NaN
cwm-baseline-70 46 7.86 1 2.2 NaN NaN
cwm-baseline-120 46 7.36 1 2.2 NaN NaN
cwm-selfplay-380 46 7.77 1 2.2 NaN NaN
cwm-baseline_60-109 45.9 7.72 5 2.2 1.8 1.3
cwm-baseline-180 45.8 7.15 1 2.2 NaN NaN
cwm-baseline-350 45.8 8.28 1 2.2 NaN NaN
cwm-baseline_80-129 45.7 7.41 5 2.2 1.8 1.3
cwm-baseline_70-119 45.7 7.51 5 2.2 1.8 1.3
cwm-baseline_50-99 45.4 7.8 5 2.2 1.8 1.3
cwm-baseline-110 45.2 7.06 1 2.2 NaN NaN
cwm-baseline_40-89 45.1 7.62 5 2.2 1.8 1.3
cwm-baseline_30-79 44.8 7.55 5 2.2 1.8 1.4
cwm-baseline-330 44.8 8.45 1 2.2 NaN NaN
cwm-baseline-50 44.8 7.99 1 2.2 NaN NaN
cwm-selfplay-10 44.6 6.66 1 2.2 NaN NaN
cwm-baseline-40 44.6 7.03 1 2.2 NaN NaN
cwm-baseline_20-69 44.2 7.32 5 2.2 1.7 1.4
cwm-baseline-80 43.8 7.34 1 2.2 NaN NaN
cwm-baseline_10-59 43.3 7.05 5 2.2 1.7 1.4
cwm-baseline-20 43 6.74 1 2.2 NaN NaN
cwm-baseline-30 42.6 6.95 1 2.2 NaN NaN
cwm-baseline-10 41.6 6.81 1 2.2 NaN NaN