There are 34 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-13398, astropy__astropy-13977, django__django-10554, django__django-10999, django__django-11087, django__django-11400, django__django-12406, django__django-13212, django__django-13513, django__django-14011, django__django-14034, django__django-14155, django__django-15629, django__django-16263, django__django-16502, django__django-16631, django__django-16667, matplotlib__matplotlib-25479, matplotlib__matplotlib-26466, pydata__xarray-6992, pydata__xarray-7229, pylint-dev__pylint-4551, pylint-dev__pylint-4604, pylint-dev__pylint-4661, pytest-dev__pytest-10356, sphinx-doc__sphinx-11510, sphinx-doc__sphinx-7462, sphinx-doc__sphinx-7748, sphinx-doc__sphinx-9461, sympy__sympy-13852, sympy__sympy-16597, sympy__sympy-20428, sympy__sympy-20438, sympy__sympy-21930
| example_link | model | min_pass1_of_model |
|---|---|---|
| django__django-11820 | 20250804_epam-ai-run-claude-4-sonnet | 0.768 |
| django__django-15252 | 20250901_warp | 0.756 |
| sphinx-doc__sphinx-7590 | 20250915_JoyCode | 0.746 |
| sympy__sympy-18199 | 20250915_JoyCode | 0.746 |
| sympy__sympy-21596 | 20250915_JoyCode | 0.746 |
| sympy__sympy-17630 | 20251015_Prometheus_v1.2.1_gpt5 | 0.744 |
| sphinx-doc__sphinx-9229 | 20251015_Prometheus_v1.2.1_gpt5 | 0.744 |
| matplotlib__matplotlib-21568 | 20250623_warp | 0.710 |
| django__django-14792 | 20250805_openhands-Qwen3-Coder-480B-A35B-Instruct | 0.696 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| astropy__astropy-7606 | 0.053 | -0.200 |
| sympy__sympy-18763 | 0.412 | -0.198 |
| django__django-11790 | 0.214 | -0.175 |
| sphinx-doc__sphinx-10614 | 0.015 | -0.046 |
| matplotlib__matplotlib-23476 | 0.015 | -0.043 |
| scikit-learn__scikit-learn-26194 | 0.023 | -0.040 |
| django__django-11477 | 0.015 | -0.016 |
| sympy__sympy-17318 | 0.076 | -0.005 |
| django__django-13794 | 0.053 | 0.001 |
| astropy__astropy-13033 | 0.015 | 0.004 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.