There are 17 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-input/112, CRUXEval-input/113, CRUXEval-input/128, CRUXEval-input/129, CRUXEval-input/177, CRUXEval-input/185, CRUXEval-input/218, CRUXEval-input/220, CRUXEval-input/259, CRUXEval-input/314, CRUXEval-input/322, CRUXEval-input/413, CRUXEval-input/444, CRUXEval-input/469, CRUXEval-input/501, CRUXEval-input/556, CRUXEval-input/729
| example_link | model | min_pass1_of_model |
|---|---|---|
| CRUXEval-input/295 | gpt-4-0613+cot | 0.737 |
| CRUXEval-input/75 | gpt-4-0613+cot | 0.737 |
| CRUXEval-input/754 | gpt-4-0613+cot | 0.737 |
| CRUXEval-input/53 | gpt-4-0613+cot | 0.737 |
| CRUXEval-input/620 | gpt-4-0613+cot | 0.737 |
| CRUXEval-input/687 | gpt-4-0613 | 0.680 |
| CRUXEval-input/229 | gpt-3.5-turbo-0613 | 0.457 |
| CRUXEval-input/375 | gpt-3.5-turbo-0613 | 0.457 |
| CRUXEval-input/491 | gpt-3.5-turbo-0613+cot | 0.443 |
| CRUXEval-input/581 | codellama-13b+cot | 0.364 |
| CRUXEval-input/179 | starcoderbase-7b | 0.254 |
| CRUXEval-input/474 | phi-1.5 | 0.161 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| CRUXEval-input/233 | 0.342 | -0.570 |
| CRUXEval-input/373 | 0.606 | -0.520 |
| CRUXEval-input/598 | 0.645 | -0.449 |
| CRUXEval-input/534 | 0.023 | -0.420 |
| CRUXEval-input/533 | 0.487 | -0.407 |
| CRUXEval-input/673 | 0.135 | -0.376 |
| CRUXEval-input/212 | 0.303 | -0.341 |
| CRUXEval-input/395 | 0.268 | -0.331 |
| CRUXEval-input/783 | 0.310 | -0.318 |
| CRUXEval-input/531 | 0.306 | -0.315 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.