mbpp: by examples

Not solved by any model

There are 9 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
Mbpp/235, Mbpp/260, Mbpp/306, Mbpp/311, Mbpp/398, Mbpp/430, Mbpp/462, Mbpp/590, Mbpp/603

example_link	model	min_pass1_of_model
Mbpp/780	gpt-4-1106-preview	0.857
Mbpp/310	meta-llama-3-70b-instruct	0.823
Mbpp/448	bigcode--starcoder2-15b-instruct-v0.1	0.780
Mbpp/765	mistral-large-latest	0.728
Mbpp/103	databricks--dbrx-instruct	0.672
Mbpp/468	microsoft--Phi-3-mini-4k-instruct	0.659
Mbpp/124	octocoder	0.593

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

Histogram of problems by the accuracy on each problem.

Histogram of problems by the minimum win rate to solve each problem.