There are 19 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-input/112, CRUXEval-input/113, CRUXEval-input/128, CRUXEval-input/129, CRUXEval-input/177, CRUXEval-input/179, CRUXEval-input/185, CRUXEval-input/218, CRUXEval-input/220, CRUXEval-input/229, CRUXEval-input/236, CRUXEval-input/259, CRUXEval-input/413, CRUXEval-input/423, CRUXEval-input/444, CRUXEval-input/501, CRUXEval-input/545, CRUXEval-input/581, CRUXEval-input/729
| example_link | model | min_pass1_of_model |
|---|---|---|
| CRUXEval-input/250 | gpt-4-0613+cot | 0.755 |
| CRUXEval-input/474 | gpt-4-0613 | 0.698 |
| CRUXEval-input/770 | phind | 0.472 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| CRUXEval-input/233 | 0.387 | -0.559 |
| CRUXEval-input/242 | 0.753 | -0.546 |
| CRUXEval-input/373 | 0.550 | -0.475 |
| CRUXEval-input/222 | 0.608 | -0.445 |
| CRUXEval-input/531 | 0.495 | -0.354 |
| CRUXEval-input/199 | 0.522 | -0.342 |
| CRUXEval-input/748 | 0.976 | -0.342 |
| CRUXEval-input/124 | 0.768 | -0.337 |
| CRUXEval-input/797 | 0.975 | -0.314 |
| CRUXEval-input/660 | 0.468 | -0.306 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.