swebench-pro: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
cwm-baseline-400 31.1 10.7 1 1.7 NaN NaN
cwm-baseline-380 30.2 9.98 1 1.7 NaN NaN
cwm-baseline-210 29.5 9.73 1 1.7 NaN NaN
cwm-selfplay-280 29.5 9.88 1 1.7 NaN NaN
cwm-baseline_360-409 29.5 9.47 5 1.7 1.3 1
cwm-baseline-370 29.4 9.16 1 1.7 NaN NaN
cwm-baseline-340 29.3 9.52 1 1.7 NaN NaN
cwm-selfplay-120 29 9.43 1 1.7 NaN NaN
cwm-baseline_340-389 29 9.11 5 1.7 1.3 1
cwm-baseline-220 28.9 9.24 1 1.7 NaN NaN
cwm-selfplay-150 28.9 9.13 1 1.7 NaN NaN
cwm-baseline_350-399 28.8 9 5 1.7 1.3 1
cwm-selfplay-270 28.7 9.43 1 1.7 NaN NaN
cwm-selfplay-220 28.6 8.61 1 1.7 NaN NaN
cwm-baseline-270 28.6 8.68 1 1.7 NaN NaN
cwm-selfplay_120-169 28.6 8.99 5 1.7 1.3 1
cwm-baseline_330-379 28.5 8.79 5 1.7 1.3 1
cwm-baseline-390 28.5 8.96 1 1.7 NaN NaN
cwm-selfplay-160 28.5 8.68 1 1.7 NaN NaN
cwm-selfplay-140 28.5 9.09 1 1.7 NaN NaN
cwm-selfplay-260 28.5 8.81 1 1.7 NaN NaN
cwm-selfplay_240-289 28.4 9.08 4 1.7 1.3 1.1
cwm-selfplay-80 28.3 8.48 1 1.7 NaN NaN
cwm-selfplay-180 28.3 8.54 1 1.7 NaN NaN
cwm-baseline-320 28.3 8.92 1 1.7 NaN NaN
cwm-baseline-170 28.3 8.64 1 1.7 NaN NaN
cwm-selfplay_140-189 28.3 8.69 5 1.7 1.3 1
cwm-baseline_320-369 28.3 8.74 5 1.7 1.3 1
cwm-baseline_210-259 28.2 8.85 5 1.7 1.3 1.1
cwm-selfplay_130-179 28.2 8.75 5 1.7 1.3 1
cwm-baseline-360 28.2 8.76 1 1.7 NaN NaN
cwm-selfplay_110-159 28.1 8.75 5 1.7 1.3 1.1
cwm-baseline_310-359 28.1 8.62 5 1.7 1.3 1
cwm-selfplay_150-199 28 8.53 5 1.7 1.3 1
cwm-baseline_200-249 28 8.67 5 1.7 1.3 1.1
cwm-selfplay-130 28 8.82 1 1.7 NaN NaN
cwm-baseline-130 28 8.11 1 1.7 NaN NaN
cwm-baseline_190-239 28 8.65 5 1.7 1.3 1
cwm-selfplay_250-299 28 8.81 5 1.7 1.3 1.1
cwm-selfplay-320 27.9 8.97 1 1.7 NaN NaN
cwm-baseline-230 27.9 8.72 1 1.7 NaN NaN
cwm-baseline-290 27.9 8.47 1 1.7 NaN NaN
cwm-baseline-140 27.9 8.74 1 1.7 NaN NaN
cwm-selfplay_260-309 27.9 8.74 5 1.7 1.3 1.1
cwm-selfplay-40 27.8 8.24 1 1.7 NaN NaN
cwm-baseline-350 27.8 8.34 1 1.7 NaN NaN
cwm-baseline-330 27.8 8.37 1 1.7 NaN NaN
cwm-selfplay_270-319 27.7 8.67 5 1.7 1.3 1.1
cwm-baseline_300-349 27.7 8.41 5 1.7 1.3 1
cwm-selfplay_160-209 27.7 8.37 5 1.7 1.3 1
cwm-selfplay_100-149 27.7 8.56 5 1.7 1.3 1.1
cwm-selfplay-60 27.6 8.04 1 1.7 NaN NaN
cwm-baseline-280 27.6 8.72 1 1.7 NaN NaN
cwm-selfplay_230-279 27.6 8.61 4 1.7 1.2 1.1
cwm-baseline-240 27.6 8.49 1 1.7 NaN NaN
cwm-selfplay-370 27.6 8.32 1 1.7 NaN NaN
cwm-selfplay-310 27.6 8.45 1 1.7 NaN NaN
cwm-selfplay_220-269 27.6 8.4 4 1.7 1.3 1
cwm-baseline_180-229 27.6 8.34 5 1.7 1.3 1
cwm-selfplay_280-329 27.6 8.58 5 1.7 1.3 1.1
cwm-baseline-190 27.5 8.38 1 1.7 NaN NaN
cwm-selfplay_80-129 27.5 8.27 5 1.7 1.3 1
cwm-baseline_170-219 27.5 8.22 5 1.7 1.3 1
cwm-baseline_270-319 27.5 8.23 5 1.7 1.3 1
cwm-baseline_290-339 27.4 8.2 5 1.7 1.3 1
cwm-selfplay_180-229 27.4 8.19 5 1.6 1.3 1
cwm-selfplay_90-139 27.4 8.34 5 1.6 1.3 1
cwm-baseline_280-329 27.4 8.27 5 1.6 1.3 1
cwm-baseline_220-269 27.4 8.29 5 1.6 1.3 1
cwm-selfplay-390 27.4 8.8 1 1.6 NaN NaN
cwm-selfplay-170 27.4 8.23 1 1.6 NaN NaN
cwm-baseline_230-279 27.3 8.18 5 1.6 1.3 1
cwm-baseline_250-299 27.3 8.18 5 1.6 1.3 1
cwm-baseline_130-179 27.3 8.15 5 1.6 1.3 1
cwm-baseline_240-289 27.3 8.18 5 1.6 1.3 1
cwm-baseline-310 27.2 8.16 1 1.6 NaN NaN
cwm-selfplay-190 27.2 8.26 1 1.6 NaN NaN
cwm-selfplay-90 27.2 7.97 1 1.6 NaN NaN
cwm-baseline-250 27.2 8.28 1 1.6 NaN NaN
cwm-selfplay_40-89 27.2 7.88 5 1.6 1.3 1
cwm-selfplay_170-219 27.2 8.11 5 1.6 1.3 1
cwm-selfplay_50-99 27.1 7.83 5 1.6 1.3 1
cwm-baseline_260-309 27.1 7.98 5 1.6 1.3 1
cwm-selfplay_60-109 27.1 7.91 5 1.6 1.3 1
cwm-selfplay_190-239 27.1 8.08 5 1.6 1.3 1
cwm-baseline_120-169 27 8.03 5 1.6 1.3 1
cwm-selfplay_200-249 27 8.04 4 1.6 1.3 1
cwm-selfplay_210-259 27 8.05 4 1.6 1.3 1
cwm-selfplay_290-339 26.9 8.19 5 1.6 1.3 1
cwm-baseline_160-209 26.9 7.95 5 1.6 1.3 1
cwm-selfplay-360 26.9 7.95 1 1.6 NaN NaN
cwm-selfplay-200 26.9 8.36 1 1.6 NaN NaN
cwm-baseline-120 26.9 8.02 1 1.6 NaN NaN
cwm-baseline-160 26.9 8.38 1 1.6 NaN NaN
cwm-selfplay_300-349 26.9 8.37 5 1.6 1.2 1.1
cwm-selfplay_360-409 26.9 8.09 4 1.6 1.3 1
cwm-selfplay_310-359 26.9 8.26 5 1.6 1.2 1.1
cwm-baseline_140-189 26.8 7.96 5 1.6 1.3 1
cwm-selfplay-100 26.8 8.2 1 1.6 NaN NaN
cwm-selfplay-50 26.8 7.76 1 1.6 NaN NaN
cwm-selfplay-250 26.8 8.38 1 1.6 NaN NaN
cwm-selfplay_320-369 26.8 8.16 5 1.6 1.2 1.1
cwm-baseline_100-149 26.8 7.83 5 1.6 1.3 1
cwm-selfplay_70-119 26.8 7.8 5 1.6 1.3 1
cwm-baseline_150-199 26.8 7.89 5 1.6 1.3 1
cwm-selfplay_350-399 26.7 7.96 5 1.6 1.3 1
cwm-selfplay_330-379 26.7 8.03 5 1.6 1.3 1.1
cwm-baseline_110-159 26.7 7.77 5 1.6 1.3 1
cwm-selfplay_30-79 26.6 7.75 5 1.6 1.3 1
cwm-selfplay_340-389 26.5 7.93 5 1.6 1.2 1.1
cwm-selfplay-330 26.5 7.96 1 1.6 NaN NaN
cwm-selfplay-230 26.5 7.98 1 1.6 NaN NaN
cwm-baseline_90-139 26.5 7.71 5 1.6 1.3 1
cwm-selfplay-290 26.4 7.78 1 1.6 NaN NaN
cwm-baseline-90 26.4 8.12 1 1.6 NaN NaN
cwm-selfplay-340 26.4 8.66 1 1.6 NaN NaN
cwm-selfplay-300 26.3 8.03 1 1.6 NaN NaN
cwm-baseline-200 26.3 7.4 1 1.6 NaN NaN
cwm-selfplay_20-69 26.2 7.61 5 1.6 1.2 1
cwm-baseline-300 26 7.31 1 1.6 NaN NaN
cwm-selfplay-350 26 7.49 1 1.6 NaN NaN
cwm-selfplay-110 26 7.48 1 1.6 NaN NaN
cwm-selfplay-210 26 7.37 1 1.6 NaN NaN
cwm-selfplay-380 25.7 7.46 1 1.6 NaN NaN
cwm-baseline-180 25.7 7.17 1 1.6 NaN NaN
cwm-baseline-100 25.7 7.39 1 1.6 NaN NaN
cwm-baseline_80-129 25.6 7.4 5 1.6 1.2 1
cwm-selfplay-70 25.4 7.07 1 1.6 NaN NaN
cwm-baseline-260 25.3 6.94 1 1.6 NaN NaN
cwm-baseline-40 25.3 7.46 1 1.6 NaN NaN
cwm-baseline-150 25.3 7.09 1 1.6 NaN NaN
cwm-baseline_70-119 25.3 7.19 5 1.6 1.2 1
cwm-selfplay-30 25.2 7.85 1 1.6 NaN NaN
cwm-baseline-110 25.2 7.09 1 1.6 NaN NaN
cwm-baseline-70 25.2 6.98 1 1.6 NaN NaN
cwm-selfplay_10-59 25 7.19 5 1.6 1.2 1.1
cwm-baseline_60-109 24.9 6.98 5 1.6 1.2 1
cwm-baseline_50-99 24.4 6.7 5 1.6 1.2 1
cwm-baseline_40-89 24.1 6.57 5 1.6 1.2 1
cwm-baseline-80 23.9 6.58 1 1.6 NaN NaN
cwm-baseline_30-79 23.9 6.45 5 1.6 1.2 1
cwm-selfplay-20 23.4 6.39 1 1.6 NaN NaN
cwm-baseline_20-69 23.3 6.22 5 1.6 1.2 1
cwm-baseline-60 23.3 6.03 1 1.6 NaN NaN
cwm-baseline-30 23 5.96 1 1.6 NaN NaN
cwm-baseline-50 23 6 1 1.6 NaN NaN
cwm-baseline_10-59 22.8 6.1 5 1.6 1.2 1
cwm-baseline-20 22 5.83 1 1.5 NaN NaN
cwm-selfplay-10 22 5.95 1 1.5 NaN NaN
cwm-baseline-10 20.9 5.46 1 1.5 NaN NaN