CRUXEval-output-T0.8: by examples

Results Paper Code


Not solved by any model

There are 35 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-output/112, CRUXEval-output/113, CRUXEval-output/125, CRUXEval-output/129, CRUXEval-output/163, CRUXEval-output/177, CRUXEval-output/211, CRUXEval-output/218, CRUXEval-output/220, CRUXEval-output/229, CRUXEval-output/250, CRUXEval-output/254, CRUXEval-output/272, CRUXEval-output/280, CRUXEval-output/301, CRUXEval-output/307, CRUXEval-output/310, CRUXEval-output/33, CRUXEval-output/347, CRUXEval-output/375, CRUXEval-output/44, CRUXEval-output/444, CRUXEval-output/445, CRUXEval-output/469, CRUXEval-output/484, CRUXEval-output/488, CRUXEval-output/501, CRUXEval-output/556, CRUXEval-output/581, CRUXEval-output/591, CRUXEval-output/599, CRUXEval-output/622, CRUXEval-output/671, CRUXEval-output/698, CRUXEval-output/726

Problems solved by 1 model only

example_link model min_pass1_of_model
CRUXEval-output/126 gpt-4-0613+cot 0.772
CRUXEval-output/128 gpt-4-0613+cot 0.772
CRUXEval-output/149 gpt-4-0613+cot 0.772
CRUXEval-output/179 gpt-4-0613+cot 0.772
CRUXEval-output/35 gpt-4-0613+cot 0.772
CRUXEval-output/340 gpt-4-0613+cot 0.772
CRUXEval-output/268 gpt-4-0613+cot 0.772
CRUXEval-output/259 gpt-4-0613+cot 0.772
CRUXEval-output/568 gpt-4-0613+cot 0.772
CRUXEval-output/5 gpt-4-0613+cot 0.772
CRUXEval-output/491 gpt-4-0613+cot 0.772
CRUXEval-output/458 gpt-4-0613+cot 0.772
CRUXEval-output/393 gpt-4-0613+cot 0.772
CRUXEval-output/438 gpt-4-0613 0.680
CRUXEval-output/236 gpt-3.5-turbo-0613+cot 0.565
CRUXEval-output/155 gpt-3.5-turbo-0613+cot 0.565
CRUXEval-output/613 gpt-3.5-turbo-0613+cot 0.565
CRUXEval-output/543 gpt-3.5-turbo-0613+cot 0.565
CRUXEval-output/391 gpt-3.5-turbo-0613+cot 0.565
CRUXEval-output/317 gpt-3.5-turbo-0613+cot 0.565
CRUXEval-output/23 codellama-python-13b 0.363
CRUXEval-output/499 mixtral-8x7b 0.363
CRUXEval-output/175 mistral-7b 0.301
CRUXEval-output/514 phi-1.5 0.217

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
CRUXEval-output/329 0.497 -0.369
CRUXEval-output/571 0.006 -0.329
CRUXEval-output/403 0.339 -0.285
CRUXEval-output/209 0.013 -0.284
CRUXEval-output/297 0.287 -0.280
CRUXEval-output/373 0.068 -0.264
CRUXEval-output/514 0.003 -0.237
CRUXEval-output/644 0.248 -0.237
CRUXEval-output/333 0.142 -0.216
CRUXEval-output/132 0.316 -0.213

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.