humaneval+: by examples

Results Paper Code


Not solved by any model

There are 7 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
HumanEval/129, HumanEval/130, HumanEval/132, HumanEval/145, HumanEval/163, HumanEval/32, HumanEval/91

Problems solved by 1 model only

example_link model min_pass1_of_model
HumanEval/140 speechless-codellama-34b 0.726
HumanEval/124 code-millenials-34b 0.720
HumanEval/93 xwincoder-34b 0.701
HumanEval/76 openchat 0.689
HumanEval/108 claude-3-sonnet-20240229 0.646
HumanEval/137 Qwen--Qwen1.5-72B-Chat 0.598

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
HumanEval/54 0.163 -0.135
HumanEval/154 0.224 -0.046
HumanEval/137 0.020 -0.004
HumanEval/122 0.061 0.010
HumanEval/83 0.041 0.024
HumanEval/47 0.939 0.025
HumanEval/108 0.020 0.042
HumanEval/126 0.041 0.051
HumanEval/11 0.837 0.070
HumanEval/65 0.408 0.078

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.