swebench-lite: by examples

Results Paper Code

Not solved by any model

There are 35 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-7746, django__django-11019, django__django-11564, django__django-11630, django__django-14667, django__django-14730, django__django-15695, django__django-16816, django__django-16820, matplotlib__matplotlib-22835, pallets__flask-5063, pydata__xarray-4493, pylint-dev__pylint-7228, scikit-learn__scikit-learn-11040, scikit-learn__scikit-learn-25638, sphinx-doc__sphinx-7686, sphinx-doc__sphinx-7738, sphinx-doc__sphinx-8282, sympy__sympy-11400, sympy__sympy-11870, sympy__sympy-12171, sympy__sympy-13146, sympy__sympy-14024, sympy__sympy-14308, sympy__sympy-14317, sympy__sympy-15308, sympy__sympy-16106, sympy__sympy-16281, sympy__sympy-17630, sympy__sympy-19254, sympy__sympy-20322, sympy__sympy-20639, sympy__sympy-21171, sympy__sympy-23191, sympy__sympy-24102

Problems solved by 1 model only

example_link	model	min_pass1_of_model
pytest-dev__pytest-5221	20250625_ExpeRepair-v1_claude-4-sonnet-20250514	0.603
django__django-14997	20250906_KGCompass_claude-4-sonnet-20250514	0.583
sympy__sympy-13895	20250906_KGCompass_claude-4-sonnet-20250514	0.583
sympy__sympy-18199	20250906_KGCompass_claude-4-sonnet-20250514	0.583
django__django-11905	20250906_KGCompass_claude-4-sonnet-20250514	0.583
sympy__sympy-12236	20250526_sweagent_claude-4-sonnet-20250514	0.567
django__django-16229	20250526_sweagent_claude-4-sonnet-20250514	0.567
sympy__sympy-13773	20250911_isea_claude-3.5-sonnet-20241022	0.513
sympy__sympy-11897	20250619_KGCompass_claude-3.5-sonnet-20241022	0.460
matplotlib__matplotlib-25433	20250901_entroPO_R2E_QwenCoder30BA3B	0.450
pydata__xarray-4248	20241207_kodu_sonnet_v1	0.447
django__django-15252	20241207_kodu_sonnet_v1	0.447
django__django-15738	20241207_kodu_sonnet_v1	0.447
sphinx-doc__sphinx-8273	20241207_kodu_sonnet_v1	0.447
sympy__sympy-13043	20241207_kodu_sonnet_v1	0.447
sympy__sympy-13437	20241207_kodu_sonnet_v1	0.447
astropy__astropy-14182	20240702_codestory_aide_mixed	0.430
django__django-13265	20241025_OpenHands-CodeAct-2.1-sonnet-20241022	0.417
django__django-13220	20250515_codartai	0.417
matplotlib__matplotlib-18869	20250515_codartai	0.417
matplotlib__matplotlib-25079	20250515_codartai	0.417
sphinx-doc__sphinx-8474	20250515_codartai	0.417
scikit-learn__scikit-learn-10949	20250515_codartai	0.417
scikit-learn__scikit-learn-10508	20250515_codartai	0.417
matplotlib__matplotlib-22711	20240627_abanteai_mentatbot_gpt4o	0.380
django__django-11742	20240627_abanteai_mentatbot_gpt4o	0.380
sympy__sympy-19007	20240622_Lingma_Agent	0.330
django__django-11910	20250207_aegis_o3mini	0.303
django__django-13768	20240523_aider	0.263

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
django__django-13768	0.012	-0.076
matplotlib__matplotlib-23299	0.036	-0.074
sympy__sympy-24909	0.107	-0.038
django__django-11910	0.012	-0.028
sympy__sympy-19007	0.012	-0.002
sympy__sympy-18835	0.036	0.013
matplotlib__matplotlib-22711	0.012	0.032
django__django-11742	0.012	0.032
pytest-dev__pytest-8365	0.071	0.038
pylint-dev__pylint-6506	0.155	0.047

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.