There are 262 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
GoogleChrome__lighthouse-10176, GoogleChrome__lighthouse-10295, GoogleChrome__lighthouse-10505, GoogleChrome__lighthouse-11068, GoogleChrome__lighthouse-11489, GoogleChrome__lighthouse-11579, GoogleChrome__lighthouse-11738, GoogleChrome__lighthouse-12970, GoogleChrome__lighthouse-13185, GoogleChrome__lighthouse-1446, GoogleChrome__lighthouse-14479, GoogleChrome__lighthouse-14515, GoogleChrome__lighthouse-14672, GoogleChrome__lighthouse-15054, GoogleChrome__lighthouse-15092, GoogleChrome__lighthouse-1549, GoogleChrome__lighthouse-1563, GoogleChrome__lighthouse-1617, GoogleChrome__lighthouse-1755, GoogleChrome__lighthouse-1786, GoogleChrome__lighthouse-1916, GoogleChrome__lighthouse-1941, GoogleChrome__lighthouse-2553, GoogleChrome__lighthouse-3583, GoogleChrome__lighthouse-3606, GoogleChrome__lighthouse-3692, GoogleChrome__lighthouse-4301, GoogleChrome__lighthouse-5084, GoogleChrome__lighthouse-5871, GoogleChrome__lighthouse-5925, GoogleChrome__lighthouse-6694, GoogleChrome__lighthouse-6922, GoogleChrome__lighthouse-6989, GoogleChrome__lighthouse-7210, GoogleChrome__lighthouse-7356, GoogleChrome__lighthouse-9151, GoogleChrome__lighthouse-9291, GoogleChrome__lighthouse-9334, GoogleChrome__lighthouse-9451, GoogleChrome__lighthouse-9727, GoogleChrome__lighthouse-9903, GoogleChrome__lighthouse-9932, PrismJS__prism-1572, PrismJS__prism-1573, PrismJS__prism-1747, PrismJS__prism-1897, PrismJS__prism-2029, PrismJS__prism-2295, PrismJS__prism-2348, PrismJS__prism-2622, PrismJS__prism-2649, PrismJS__prism-2678, PrismJS__prism-2680, PrismJS__prism-2686, PrismJS__prism-2713, PrismJS__prism-2754, PrismJS__prism-2782, PrismJS__prism-2792, PrismJS__prism-2841, PrismJS__prism-3001, PrismJS__prism-3050, PrismJS__prism-3141, PrismJS__prism-3351, PrismJS__prism-3355, PrismJS__prism-3372, alibaba-fusion__next-101, alibaba-fusion__next-114, alibaba-fusion__next-1500, alibaba-fusion__next-1708, alibaba-fusion__next-1720, alibaba-fusion__next-1742, alibaba-fusion__next-1788, alibaba-fusion__next-1807, alibaba-fusion__next-2131, alibaba-fusion__next-2164, alibaba-fusion__next-2355, alibaba-fusion__next-2860, alibaba-fusion__next-2919, alibaba-fusion__next-2923, alibaba-fusion__next-3034, alibaba-fusion__next-3198, alibaba-fusion__next-3218, alibaba-fusion__next-3345, alibaba-fusion__next-3445, alibaba-fusion__next-3724, alibaba-fusion__next-3947, alibaba-fusion__next-4021, alibaba-fusion__next-4859, alibaba-fusion__next-665, alibaba-fusion__next-870, alibaba-fusion__next-94, alibaba-fusion__next-966, bpmn-io__bpmn-js-1011, bpmn-io__bpmn-js-1192, bpmn-io__bpmn-js-1203, bpmn-io__bpmn-js-1236, bpmn-io__bpmn-js-1256, carbon-design-system__carbon-10188, carbon-design-system__carbon-10214, carbon-design-system__carbon-10225, carbon-design-system__carbon-10262, carbon-design-system__carbon-10283, carbon-design-system__carbon-10301, carbon-design-system__carbon-10599, carbon-design-system__carbon-11352, carbon-design-system__carbon-11416, carbon-design-system__carbon-11613, carbon-design-system__carbon-11621, carbon-design-system__carbon-11743, carbon-design-system__carbon-11761, carbon-design-system__carbon-12027, carbon-design-system__carbon-12151, carbon-design-system__carbon-12262, carbon-design-system__carbon-12412, carbon-design-system__carbon-12435, carbon-design-system__carbon-12442, carbon-design-system__carbon-13196, carbon-design-system__carbon-13218, carbon-design-system__carbon-13224, carbon-design-system__carbon-13421, carbon-design-system__carbon-13527, carbon-design-system__carbon-13851, carbon-design-system__carbon-14239, carbon-design-system__carbon-14476, carbon-design-system__carbon-15197, carbon-design-system__carbon-16237, carbon-design-system__carbon-16332, carbon-design-system__carbon-2885, carbon-design-system__carbon-3139, carbon-design-system__carbon-3237, carbon-design-system__carbon-3253, carbon-design-system__carbon-3283, carbon-design-system__carbon-3362, carbon-design-system__carbon-3626, carbon-design-system__carbon-3859, carbon-design-system__carbon-3918, carbon-design-system__carbon-3928, carbon-design-system__carbon-4028, carbon-design-system__carbon-4055, carbon-design-system__carbon-4260, carbon-design-system__carbon-4273, carbon-design-system__carbon-4286, carbon-design-system__carbon-4307, carbon-design-system__carbon-4430, carbon-design-system__carbon-4741, carbon-design-system__carbon-4754, carbon-design-system__carbon-4801, carbon-design-system__carbon-4820, carbon-design-system__carbon-4834, carbon-design-system__carbon-4862, carbon-design-system__carbon-4891, carbon-design-system__carbon-4952, carbon-design-system__carbon-4991, carbon-design-system__carbon-4999, carbon-design-system__carbon-5035, carbon-design-system__carbon-5173, carbon-design-system__carbon-5330, carbon-design-system__carbon-5485, carbon-design-system__carbon-6197, carbon-design-system__carbon-6410, carbon-design-system__carbon-6566, carbon-design-system__carbon-6726, carbon-design-system__carbon-6949, carbon-design-system__carbon-7012, carbon-design-system__carbon-7212, carbon-design-system__carbon-7288, carbon-design-system__carbon-7353, carbon-design-system__carbon-7478, carbon-design-system__carbon-7687, carbon-design-system__carbon-7768, carbon-design-system__carbon-7908, carbon-design-system__carbon-8022, carbon-design-system__carbon-8130, carbon-design-system__carbon-8222, carbon-design-system__carbon-8279, carbon-design-system__carbon-8296, carbon-design-system__carbon-8452, carbon-design-system__carbon-8469, carbon-design-system__carbon-8477, carbon-design-system__carbon-9074, carbon-design-system__carbon-9189, carbon-design-system__carbon-9700, carbon-design-system__carbon-9812, carbon-design-system__carbon-9994, eslint__eslint-14033, eslint__eslint-8850, grommet__grommet-2124, grommet__grommet-2131, grommet__grommet-2695, grommet__grommet-5243, grommet__grommet-6293, grommet__grommet-6296, grommet__grommet-6350, grommet__grommet-6438, grommet__grommet-6490, grommet__grommet-6494, grommet__grommet-6600, grommet__grommet-6749, grommet__grommet-7025, highlightjs__highlight.js-2684, highlightjs__highlight.js-2703, highlightjs__highlight.js-2704, highlightjs__highlight.js-2726, highlightjs__highlight.js-2727, highlightjs__highlight.js-2740, highlightjs__highlight.js-2750, highlightjs__highlight.js-2765, highlightjs__highlight.js-2785, highlightjs__highlight.js-2811, highlightjs__highlight.js-2897, highlightjs__highlight.js-2899, highlightjs__highlight.js-2927, highlightjs__highlight.js-2932, highlightjs__highlight.js-2960, highlightjs__highlight.js-2969, highlightjs__highlight.js-2972, highlightjs__highlight.js-3000, highlightjs__highlight.js-3070, highlightjs__highlight.js-3154, highlightjs__highlight.js-3203, highlightjs__highlight.js-3207, highlightjs__highlight.js-3212, highlightjs__highlight.js-3249, highlightjs__highlight.js-3278, highlightjs__highlight.js-3287, highlightjs__highlight.js-3301, highlightjs__highlight.js-3316, highlightjs__highlight.js-3367, highlightjs__highlight.js-3411, highlightjs__highlight.js-3516, highlightjs__highlight.js-3559, openlayers__openlayers-13860, openlayers__openlayers-13893, prettier__prettier-12177, prettier__prettier-14688, prettier__prettier-14961, prettier__prettier-4202, prettier__prettier-6319, prettier__prettier-9514, prettier__prettier-9866, quarto-dev__quarto-cli-1650, quarto-dev__quarto-cli-2583, quarto-dev__quarto-cli-3853, quarto-dev__quarto-cli-4025, quarto-dev__quarto-cli-4539, quarto-dev__quarto-cli-4695, quarto-dev__quarto-cli-4708, quarto-dev__quarto-cli-4732, quarto-dev__quarto-cli-5010, quarto-dev__quarto-cli-5064, quarto-dev__quarto-cli-5091, quarto-dev__quarto-cli-5292, quarto-dev__quarto-cli-5547, quarto-dev__quarto-cli-6388, quarto-dev__quarto-cli-6659, quarto-dev__quarto-cli-6902, quarto-dev__quarto-cli-896, scratchfoundation__scratch-gui-2778, scratchfoundation__scratch-gui-3342, scratchfoundation__scratch-gui-4568, scratchfoundation__scratch-gui-8492, scratchfoundation__scratch-gui-8891
| example_link | model | min_pass1_of_model |
|---|---|---|
| GoogleChrome__lighthouse-5791 | 20251117_codefuse_pycfuse_svr | 0.365 |
| carbon-design-system__carbon-12302 | 20250701_GUIRepair_o3 | 0.365 |
| alibaba-fusion__next-4806 | 20251117_codefuse_pycfuse_svr | 0.365 |
| grommet__grommet-6246 | 20250701_GUIRepair_o3 | 0.365 |
| highlightjs__highlight.js-3438 | 20251117_codefuse_pycfuse_svr | 0.365 |
| carbon-design-system__carbon-6960 | 20250701_GUIRepair_o3 | 0.365 |
| scratchfoundation__scratch-gui-5039 | 20251117_codefuse_pycfuse_svr | 0.365 |
| carbon-design-system__carbon-4226 | 20250701_GUIRepair_o3 | 0.365 |
| bpmn-io__bpmn-js-1174 | 20250611_Refact_Agent_claude-4-sonnet | 0.361 |
| prettier__prettier-4115 | 20250611_Refact_Agent_claude-4-sonnet | 0.361 |
| PrismJS__prism-1500 | 20250611_Refact_Agent_claude-4-sonnet | 0.361 |
| carbon-design-system__carbon-8092 | 20250611_Refact_Agent_claude-4-sonnet | 0.361 |
| bpmn-io__bpmn-js-1179 | 20250528_OpenHands-Versa-claude4 | 0.349 |
| PrismJS__prism-2946 | 20250528_OpenHands-Versa-claude4 | 0.349 |
| carbon-design-system__carbon-7350 | 20250528_OpenHands-Versa-claude4 | 0.349 |
| grommet__grommet-2061 | 20250531_GUIRepair_o4mini | 0.343 |
| eslint__eslint-11407 | 20250531_GUIRepair_o4mini | 0.343 |
| carbon-design-system__carbon-8945 | 20250531_GUIRepair_o4mini | 0.343 |
| bpmn-io__bpmn-js-1092 | 20250531_GUIRepair_o4mini | 0.343 |
| carbon-design-system__carbon-12402 | 20250531_GUIRepair_o4mini | 0.343 |
| carbon-design-system__carbon-8912 | 20250509_OpenHands-Versa_claude3.7 | 0.318 |
| GoogleChrome__lighthouse-14800 | 20250509_OpenHands-Versa_claude3.7 | 0.318 |
| PrismJS__prism-3442 | 20250509_OpenHands-Versa_claude3.7 | 0.318 |
| bpmn-io__bpmn-js-1080 | 20250531_GUIRepair_gpt4o | 0.308 |
| carbon-design-system__carbon-4816 | 20250325_globant_codefixer_agent | 0.300 |
| highlightjs__highlight.js-3457 | 20250325_globant_codefixer_agent | 0.300 |
| carbon-design-system__carbon-6442 | 20250226_agentless_lite_claude3.5 | 0.257 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| carbon-design-system__carbon-6442 | 0.083 | -0.411 |
| grommet__grommet-6239 | 0.167 | -0.333 |
| highlightjs__highlight.js-3457 | 0.083 | -0.262 |
| carbon-design-system__carbon-4816 | 0.083 | -0.262 |
| alibaba-fusion__next-895 | 0.167 | -0.222 |
| bpmn-io__bpmn-js-1080 | 0.083 | -0.187 |
| bpmn-io__bpmn-js-1151 | 0.250 | -0.167 |
| quarto-dev__quarto-cli-4064 | 0.250 | -0.167 |
| bpmn-io__bpmn-js-1365 | 0.167 | -0.166 |
| openlayers__openlayers-9083 | 0.833 | -0.166 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.