mbpp: by models

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-14b	72.2	31.4	12	2	1.9	0.7
qwen2.5-coder-14b-instruct	72.1	32.9	12	2	1.6	1.2
google_gemma_3_12b_it	69.3	29.3	11	2.1	1.9	0.73
deepseek_r1_distill_llama_70b	66.3	26.7	11	2.1	1.8	1.1
qwen3-8b	66.1	27.1	12	2.1	1.9	0.88
qwen2.5-coder-7b-instruct	64.8	27.9	10	2.1	1.6	1.4
qwen3-4b	64.1	25.5	12	2.1	2	0.83
mistralai_mixtral_8x22b_instruct_v0.1	63.9	25.2	11	2.1	1.8	1.2
google_gemma_3_4b_it	60	22.6	13	2.2	2	0.81
deepseek_r1_distill_qwen_14b	57.1	21.1	11	2.2	1.8	1.3
llama-3.1-8B-instruct	56.2	19.9	15	2.2	2.2	0
qwen2.5-coder-3b-instruct	55	20.6	12	2.2	1.7	1.5
mistralai_mixtral_8x7b_instruct_v0.1	52.1	17.7	12	2.2	1.9	1.2
mistralai_ministral_8b_instruct_2410	52	17.3	11	2.2	1.8	1.3
qwen2-7b-instruct	51.1	17.2	11	2.2	1.8	1.3
llama-3.2-3B-instruct	48.8	15.7	15	2.2	2.2	0
mistralai_mathstral_7b_v0.1	48.7	15.7	11	2.2	1.8	1.4
qwen3-1.7b	48.1	15.7	12	2.2	2	0.99
deepseek_v2_lite_chat	44.5	13.6	11	2.2	1.8	1.3
qwen2.5-coder-1.5b-instruct	43.7	13.8	11	2.2	1.7	1.5
qwen1.5-14b-chat	40.2	11.4	12	2.2	1.8	1.2
deepseek_r1_distill_llama_8b	39.5	11.7	12	2.2	1.7	1.3
mistralai_mistral_7b_instruct_v0.3	39.2	10.7	11	2.2	1.8	1.2
deepseek_r1_distill_qwen_7b	39.2	12	11	2.2	1.6	1.4
mistralai_mistral_7b_instruct_v0.2	36.4	9.98	10	2.2	1.8	1.2
qwen1.5-7b-chat	34.8	9.06	12	2.1	1.8	1.2
llama-3.2-1B-instruct	32	8.18	11	2.1	2.1	0
qwen2.5-coder-0.5b-instruct	31.9	8.37	13	2.1	1.6	1.3
mistralai_mistral_7b_instruct_v0.1	30.5	7.54	11	2.1	1.5	1.4
qwen3-0.6b	27.6	6.97	13	2	1.6	1.2
qwen2-1.5b-instruct	23.5	5.27	13	1.9	1.3	1.4
deepseek_r1_distill_qwen_1.5b	14.6	3.2	12	1.6	1.1	1.1
qwen2-0.5b-instruct	13	2.3	13	1.5	0.97	1.1
qwen1.5-1.8b-chat	12.6	2.31	11	1.5	1	1.1
qwen1.5-0.5b-chat	4.17	0.577	13	0.89	0.54	0.71