There are 63 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-13033, astropy__astropy-13398, astropy__astropy-13977, astropy__astropy-14598, astropy__astropy-7606, astropy__astropy-8707, astropy__astropy-8872, django__django-10554, django__django-10999, django__django-11087, django__django-11400, django__django-11820, django__django-13212, django__django-13513, django__django-13794, django__django-14011, django__django-14034, django__django-14155, django__django-14170, django__django-14315, django__django-14792, django__django-15098, django__django-15252, django__django-15629, django__django-15695, django__django-16256, django__django-16263, django__django-16502, django__django-16631, django__django-16667, matplotlib__matplotlib-23476, matplotlib__matplotlib-24870, matplotlib__matplotlib-25479, matplotlib__matplotlib-26208, matplotlib__matplotlib-26466, pydata__xarray-6992, pydata__xarray-7229, pylint-dev__pylint-4551, pylint-dev__pylint-4604, pylint-dev__pylint-4661, pylint-dev__pylint-6528, pylint-dev__pylint-7277, pytest-dev__pytest-10356, pytest-dev__pytest-5840, scikit-learn__scikit-learn-25747, scikit-learn__scikit-learn-26194, sphinx-doc__sphinx-10614, sphinx-doc__sphinx-11510, sphinx-doc__sphinx-7462, sphinx-doc__sphinx-7590, sphinx-doc__sphinx-7748, sphinx-doc__sphinx-7985, sphinx-doc__sphinx-9229, sphinx-doc__sphinx-9602, sympy__sympy-13852, sympy__sympy-16597, sympy__sympy-17630, sympy__sympy-18199, sympy__sympy-20428, sympy__sympy-20438, sympy__sympy-21596, sympy__sympy-21930, sympy__sympy-22080
| example_link | model | min_pass1_of_model |
|---|---|---|
| django__django-12406 | Claude 4.5 Opus medium (20251101) | 0.744 |
| sphinx-doc__sphinx-8548 | Claude 4.5 Opus medium (20251101) | 0.744 |
| matplotlib__matplotlib-21568 | Claude 4.5 Opus medium (20251101) | 0.744 |
| django__django-15957 | Claude 4.5 Opus medium (20251101) | 0.744 |
| psf__requests-6028 | Claude 4.5 Opus medium (20251101) | 0.744 |
| sphinx-doc__sphinx-9461 | Claude 4.5 Opus medium (20251101) | 0.744 |
| django__django-11734 | Gemini 3 Pro Preview (2025-11-18) | 0.742 |
| django__django-13344 | Gemini 3 Pro Preview (2025-11-18) | 0.742 |
| django__django-11265 | Gemini 3 Pro Preview (2025-11-18) | 0.742 |
| sympy__sympy-14248 | Gemini 3 Pro Preview (2025-11-18) | 0.742 |
| sympy__sympy-13798 | Claude 4.5 Sonnet (20250929) | 0.706 |
| django__django-13195 | GPT-5.2 (2025-12-11) | 0.690 |
| django__django-11141 | Minimax M2 | 0.610 |
| pylint-dev__pylint-8898 | Minimax M2 | 0.610 |
| django__django-12273 | GLM-4.6 (T=1) | 0.554 |
| psf__requests-1921 | Qwen3-Coder 480B/A35B Instruct | 0.554 |
| matplotlib__matplotlib-23299 | gpt-oss-120b | 0.260 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| pytest-dev__pytest-7205 | 0.571 | -0.324 |
| psf__requests-2317 | 0.250 | -0.268 |
| sympy__sympy-18763 | 0.071 | -0.236 |
| psf__requests-1724 | 0.214 | -0.229 |
| matplotlib__matplotlib-23299 | 0.036 | -0.208 |
| psf__requests-1766 | 0.250 | -0.136 |
| sympy__sympy-23950 | 0.393 | -0.109 |
| psf__requests-1921 | 0.036 | -0.060 |
| django__django-12273 | 0.036 | -0.060 |
| django__django-11477 | 0.107 | -0.042 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.