mbpp: by examples

Results Paper Code


Not solved by any model

There are 15 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
125, 132, 15, 169, 20, 264, 299, 356, 363, 425, 427, 450, 451, 482, 99

Problems solved by 1 model only

example_link model min_pass1_of_model
433 qwen3-14b 0.722
113 deepseek_r1_distill_llama_70b 0.663
187 qwen2.5-coder-7b-instruct 0.648
295 qwen3-4b 0.641
472 qwen2.5-coder-3b-instruct 0.550
302 qwen2.5-coder-0.5b-instruct 0.319
317 mistralai_mistral_7b_instruct_v0.1 0.305

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
76 0.014 -0.235
127 0.007 -0.207
317 0.003 -0.155
302 0.002 -0.141
90 0.040 -0.108
123 0.033 -0.100
301 0.052 -0.075
432 0.108 -0.071
364 0.236 -0.062
200 0.287 -0.036

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.