mbpp: by examples

Results Paper Code


Not solved by any model

There are 9 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
Mbpp/235, Mbpp/260, Mbpp/306, Mbpp/311, Mbpp/398, Mbpp/430, Mbpp/462, Mbpp/590, Mbpp/603

Problems solved by 1 model only

example_link model min_pass1_of_model
Mbpp/780 gpt-4-1106-preview 0.857
Mbpp/310 meta-llama-3-70b-instruct 0.823
Mbpp/448 bigcode--starcoder2-15b-instruct-v0.1 0.780
Mbpp/765 mistral-large-latest 0.728
Mbpp/103 databricks--dbrx-instruct 0.672
Mbpp/468 microsoft--Phi-3-mini-4k-instruct 0.659
Mbpp/124 octocoder 0.593

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
Mbpp/87 0.983 -0.159
Mbpp/581 0.119 -0.156
Mbpp/615 0.085 -0.155
Mbpp/142 0.983 -0.149
Mbpp/77 0.339 -0.132
Mbpp/567 0.915 -0.130
Mbpp/404 0.983 -0.099
Mbpp/138 0.034 -0.063
Mbpp/126 0.085 -0.062
Mbpp/20 0.186 -0.060

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.