CRUXEval-input-T0.8: by examples

Results Paper Code


Not solved by any model

There are 17 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-input/112, CRUXEval-input/113, CRUXEval-input/128, CRUXEval-input/129, CRUXEval-input/177, CRUXEval-input/185, CRUXEval-input/218, CRUXEval-input/220, CRUXEval-input/259, CRUXEval-input/314, CRUXEval-input/322, CRUXEval-input/413, CRUXEval-input/444, CRUXEval-input/469, CRUXEval-input/501, CRUXEval-input/556, CRUXEval-input/729

Problems solved by 1 model only

example_link model min_pass1_of_model
CRUXEval-input/295 gpt-4-0613+cot 0.737
CRUXEval-input/75 gpt-4-0613+cot 0.737
CRUXEval-input/754 gpt-4-0613+cot 0.737
CRUXEval-input/53 gpt-4-0613+cot 0.737
CRUXEval-input/620 gpt-4-0613+cot 0.737
CRUXEval-input/687 gpt-4-0613 0.680
CRUXEval-input/229 gpt-3.5-turbo-0613 0.457
CRUXEval-input/375 gpt-3.5-turbo-0613 0.457
CRUXEval-input/491 gpt-3.5-turbo-0613+cot 0.443
CRUXEval-input/581 codellama-13b+cot 0.364
CRUXEval-input/179 starcoderbase-7b 0.254
CRUXEval-input/474 phi-1.5 0.161

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
CRUXEval-input/233 0.342 -0.570
CRUXEval-input/373 0.606 -0.520
CRUXEval-input/598 0.645 -0.449
CRUXEval-input/534 0.023 -0.420
CRUXEval-input/533 0.487 -0.407
CRUXEval-input/673 0.135 -0.376
CRUXEval-input/212 0.303 -0.341
CRUXEval-input/395 0.268 -0.331
CRUXEval-input/783 0.310 -0.318
CRUXEval-input/531 0.306 -0.315

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.