mbpp: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-14b 72.9 35.9 3 2 1.8 0.79
google_gemma_3_12b_it 69.5 33.1 4 2.1 1.9 0.8
qwen2.5-coder-14b-instruct 68.1 34 3 2.1 1.6 1.3
qwen3-8b 65.3 30.2 3 2.1 1.9 1
deepseek_r1_distill_llama_70b 64.1 28.9 2 2.1 1.7 1.3
qwen3-4b 64 29.1 4 2.1 1.9 0.9
google_gemma_3_4b_it 60.3 26.3 5 2.2 2 0.89
mistralai_mixtral_8x22b_instruct_v0.1 60.1 26.4 3 2.2 1.7 1.4
qwen2.5-coder-7b-instruct 57.3 26.7 4 2.2 1.5 1.7
deepseek_r1_distill_qwen_14b 53.3 22.1 4 2.2 1.7 1.4
llama-3.1-8B-instruct 52.2 21 7 2.2 2.2 0
mistralai_mixtral_8x7b_instruct_v0.1 49.1 19.3 3 2.2 1.8 1.3
qwen2-7b-instruct 48 18.7 4 2.2 1.7 1.4
qwen2.5-coder-3b-instruct 47.9 19.9 4 2.2 1.5 1.6
qwen3-1.7b 47.8 18.5 4 2.2 1.9 1.1
llama-3.2-3B-instruct 45.8 17.3 8 2.2 2.2 0
mistralai_ministral_8b_instruct_2410 44.9 16.7 4 2.2 1.7 1.5
mistralai_mathstral_7b_v0.1 41.1 14.7 4 2.2 1.6 1.5
deepseek_v2_lite_chat 40.3 14.6 3 2.2 1.7 1.4
qwen1.5-14b-chat 37.2 12.3 3 2.2 1.7 1.3
mistralai_mistral_7b_instruct_v0.3 36.8 12.1 4 2.2 1.7 1.3
deepseek_r1_distill_llama_8b 34.2 11.8 4 2.1 1.5 1.5
mistralai_mistral_7b_instruct_v0.2 34.1 11.3 4 2.1 1.6 1.4
qwen2.5-coder-1.5b-instruct 32.8 11.3 4 2.1 1.3 1.6
deepseek_r1_distill_qwen_7b 32.3 11.5 4 2.1 1.4 1.6
qwen1.5-7b-chat 31.9 9.96 3 2.1 1.6 1.3
llama-3.2-1B-instruct 26.2 8.06 11 2 2 0
qwen2.5-coder-0.5b-instruct 25 7.9 5 1.9 1.3 1.5
qwen3-0.6b 24.1 6.9 5 1.9 1.5 1.2
mistralai_mistral_7b_instruct_v0.1 22.9 6.63 4 1.9 1.2 1.4
qwen2-1.5b-instruct 12.4 3 4 1.5 0.9 1.2
deepseek_r1_distill_qwen_1.5b 11.2 2.86 4 1.4 0.88 1.1
qwen1.5-1.8b-chat 7.13 1.58 3 1.2 0.7 0.92
qwen2-0.5b-instruct 6.72 1.44 5 1.1 0.59 0.95
qwen1.5-0.5b-chat 1.12 0.179 5 0.47 0.18 0.43