mbpp: by examples

Results Paper Code

Not solved by any model

There are 19 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
125, 132, 15, 169, 187, 198, 20, 232, 264, 317, 356, 425, 450, 451, 472, 482, 49, 72, 99

Problems solved by 1 model only

example_link	model	min_pass1_of_model
433	qwen3-14b	0.729
295	qwen3-14b	0.729
147	qwen2.5-coder-14b-instruct	0.681
37	qwen2.5-coder-14b-instruct	0.681
372	qwen2.5-coder-14b-instruct	0.681
293	qwen2.5-coder-14b-instruct	0.681
208	qwen3-8b	0.653
337	qwen3-8b	0.653
292	qwen3-8b	0.653
328	qwen3-4b	0.640
128	qwen2.5-coder-7b-instruct	0.573
363	qwen2.5-coder-7b-instruct	0.573
4	qwen2.5-coder-7b-instruct	0.573
207	qwen2.5-coder-7b-instruct	0.573
222	qwen2.5-coder-7b-instruct	0.573
224	deepseek_r1_distill_qwen_14b	0.533
302	qwen2.5-coder-0.5b-instruct	0.250
427	qwen2.5-coder-0.5b-instruct	0.250
113	qwen3-0.6b	0.241
127	mistralai_mistral_7b_instruct_v0.1	0.229

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
90	0.022	-0.202
127	0.007	-0.169
113	0.006	-0.155
427	0.006	-0.141
302	0.006	-0.141
31	0.064	-0.035
364	0.211	-0.032
129	0.076	-0.023
300	0.017	-0.015
404	0.027	-0.011

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.