There are 35 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-7746, django__django-11019, django__django-11564, django__django-11630, django__django-14667, django__django-14730, django__django-15695, django__django-16816, django__django-16820, matplotlib__matplotlib-22835, pallets__flask-5063, pydata__xarray-4493, pylint-dev__pylint-7228, scikit-learn__scikit-learn-11040, scikit-learn__scikit-learn-25638, sphinx-doc__sphinx-7686, sphinx-doc__sphinx-7738, sphinx-doc__sphinx-8282, sympy__sympy-11400, sympy__sympy-11870, sympy__sympy-12171, sympy__sympy-13146, sympy__sympy-14024, sympy__sympy-14308, sympy__sympy-14317, sympy__sympy-15308, sympy__sympy-16106, sympy__sympy-16281, sympy__sympy-17630, sympy__sympy-19254, sympy__sympy-20322, sympy__sympy-20639, sympy__sympy-21171, sympy__sympy-23191, sympy__sympy-24102
| example_link | model | min_pass1_of_model |
|---|---|---|
| pytest-dev__pytest-5221 | 20250625_ExpeRepair-v1_claude-4-sonnet-20250514 | 0.603 |
| django__django-14997 | 20250906_KGCompass_claude-4-sonnet-20250514 | 0.583 |
| sympy__sympy-13895 | 20250906_KGCompass_claude-4-sonnet-20250514 | 0.583 |
| sympy__sympy-18199 | 20250906_KGCompass_claude-4-sonnet-20250514 | 0.583 |
| django__django-11905 | 20250906_KGCompass_claude-4-sonnet-20250514 | 0.583 |
| sympy__sympy-12236 | 20250526_sweagent_claude-4-sonnet-20250514 | 0.567 |
| django__django-16229 | 20250526_sweagent_claude-4-sonnet-20250514 | 0.567 |
| sympy__sympy-13773 | 20250911_isea_claude-3.5-sonnet-20241022 | 0.513 |
| sympy__sympy-11897 | 20250619_KGCompass_claude-3.5-sonnet-20241022 | 0.460 |
| matplotlib__matplotlib-25433 | 20250901_entroPO_R2E_QwenCoder30BA3B | 0.450 |
| pydata__xarray-4248 | 20241207_kodu_sonnet_v1 | 0.447 |
| django__django-15252 | 20241207_kodu_sonnet_v1 | 0.447 |
| django__django-15738 | 20241207_kodu_sonnet_v1 | 0.447 |
| sphinx-doc__sphinx-8273 | 20241207_kodu_sonnet_v1 | 0.447 |
| sympy__sympy-13043 | 20241207_kodu_sonnet_v1 | 0.447 |
| sympy__sympy-13437 | 20241207_kodu_sonnet_v1 | 0.447 |
| astropy__astropy-14182 | 20240702_codestory_aide_mixed | 0.430 |
| django__django-13265 | 20241025_OpenHands-CodeAct-2.1-sonnet-20241022 | 0.417 |
| django__django-13220 | 20250515_codartai | 0.417 |
| matplotlib__matplotlib-18869 | 20250515_codartai | 0.417 |
| matplotlib__matplotlib-25079 | 20250515_codartai | 0.417 |
| sphinx-doc__sphinx-8474 | 20250515_codartai | 0.417 |
| scikit-learn__scikit-learn-10949 | 20250515_codartai | 0.417 |
| scikit-learn__scikit-learn-10508 | 20250515_codartai | 0.417 |
| matplotlib__matplotlib-22711 | 20240627_abanteai_mentatbot_gpt4o | 0.380 |
| django__django-11742 | 20240627_abanteai_mentatbot_gpt4o | 0.380 |
| sympy__sympy-19007 | 20240622_Lingma_Agent | 0.330 |
| django__django-11910 | 20250207_aegis_o3mini | 0.303 |
| django__django-13768 | 20240523_aider | 0.263 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| django__django-13768 | 0.012 | -0.076 |
| matplotlib__matplotlib-23299 | 0.036 | -0.074 |
| sympy__sympy-24909 | 0.107 | -0.038 |
| django__django-11910 | 0.012 | -0.028 |
| sympy__sympy-19007 | 0.012 | -0.002 |
| sympy__sympy-18835 | 0.036 | 0.013 |
| matplotlib__matplotlib-22711 | 0.012 | 0.032 |
| django__django-11742 | 0.012 | 0.032 |
| pytest-dev__pytest-8365 | 0.071 | 0.038 |
| pylint-dev__pylint-6506 | 0.155 | 0.047 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.