swebench-verified: by examples

Results Paper Code

Not solved by any model

There are 34 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-13398, astropy__astropy-13977, django__django-10554, django__django-10999, django__django-11087, django__django-11400, django__django-12406, django__django-13212, django__django-13513, django__django-14011, django__django-14034, django__django-14155, django__django-15629, django__django-16263, django__django-16502, django__django-16631, django__django-16667, matplotlib__matplotlib-25479, matplotlib__matplotlib-26466, pydata__xarray-6992, pydata__xarray-7229, pylint-dev__pylint-4551, pylint-dev__pylint-4604, pylint-dev__pylint-4661, pytest-dev__pytest-10356, sphinx-doc__sphinx-11510, sphinx-doc__sphinx-7462, sphinx-doc__sphinx-7748, sphinx-doc__sphinx-9461, sympy__sympy-13852, sympy__sympy-16597, sympy__sympy-20428, sympy__sympy-20438, sympy__sympy-21930

Problems solved by 1 model only

example_link	model	min_pass1_of_model
django__django-11820	20250804_epam-ai-run-claude-4-sonnet	0.768
django__django-15252	20250901_warp	0.756
sphinx-doc__sphinx-7590	20250915_JoyCode	0.746
sympy__sympy-18199	20250915_JoyCode	0.746
sympy__sympy-21596	20250915_JoyCode	0.746
sympy__sympy-17630	20251015_Prometheus_v1.2.1_gpt5	0.744
sphinx-doc__sphinx-9229	20251015_Prometheus_v1.2.1_gpt5	0.744
matplotlib__matplotlib-21568	20250623_warp	0.710
django__django-14792	20250805_openhands-Qwen3-Coder-480B-A35B-Instruct	0.696

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
astropy__astropy-7606	0.053	-0.200
sympy__sympy-18763	0.412	-0.198
django__django-11790	0.214	-0.175
sphinx-doc__sphinx-10614	0.015	-0.046
matplotlib__matplotlib-23476	0.015	-0.043
scikit-learn__scikit-learn-26194	0.023	-0.040
django__django-11477	0.015	-0.016
sympy__sympy-17318	0.076	-0.005
django__django-13794	0.053	0.001
astropy__astropy-13033	0.015	0.004

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.