leetcode: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	40.1	38.5	12	3.7	3.2	1.8
google_gemma_2_27b_it	20.1	19	10	3	1.8	2.4
llama-3.1-8B-instruct	18.3	17	15	2.9	2.7	0.91
google_gemma_3_4b_it	17.5	16.4	13	2.8	2.5	1.3
google_gemma_2_9b_it	16.8	15.7	11	2.8	2	2
llama-3.2-3B-instruct	9.74	8.93	17	2.2	2.1	0.64
google_gemma_3_12b_it	6.97	6.56	11	1.9	1.5	1.2
google_codegemma_1.1_7b_it	6.28	5.76	13	1.8	1.2	1.4
llama-3.2-1B-instruct	4.03	3.64	12	1.5	1.4	0.46
google_gemma_7b_it	2.5	2.24	12	1.2	0.87	0.77
mistralai_mixtral_8x22b_instruct_v0.1	0.859	0.795	11	0.69	0.084	0.68
google_gemma_3_1b_it	0.694	0.629	12	0.62	0.25	0.57
google_gemma_2b_it	0.641	0.565	13	0.59	0.24	0.54
mistralai_mathstral_7b_v0.1	0.0505	0.0479	11	0.17	0	0.17
qwen3-32b	0.0505	0.048	11	0.17	0	0.17
qwen2.5-coder-14b-instruct	0.0463	0.044	12	0.16	0	0.16
qwen2.5-coder-3b-instruct	0.0463	0.0463	12	0.16	0	0.16
deepseek_r1_distill_llama_70b	0	0	9	0	0	0
deepseek_r1_distill_qwen_1.5b	0	0	12	0	0	0
deepseek_v2_lite_chat	0	0	11	0	0	0
deepseek_r1_distill_qwen_7b	0	0	11	0	0	0
deepseek_r1_distill_qwen_32b	0	0	9	0	0	0
deepseek_r1_distill_qwen_14b	0	0	11	0	0	0
deepseek_r1_distill_llama_8b	0	0	12	0	0	0
mistralai_mistral_7b_instruct_v0.2	0	0	10	0	0	0
mistralai_mistral_7b_instruct_v0.1	0	0	11	0	0	0
mistralai_ministral_8b_instruct_2410	0	0	11	0	0	0
qwen1.5-1.8b-chat	0	0	11	0	0	0
qwen1.5-14b-chat	0	0	12	0	0	0
qwen1.5-32b-chat	0	0	11	0	0	0
qwen1.5-72b-chat	0	0	10	0	0	0
qwen1.5-7b-chat	0	0	12	0	0	0
mistralai_mistral_7b_instruct_v0.3	0	0	11	0	0	0
mistralai_mixtral_8x7b_instruct_v0.1	0	0	12	0	0	0
qwen1.5-0.5b-chat	0	0	13	0	0	0
qwen2-72b-instruct	0	0	10	0	0	0
qwen2-1.5b-instruct	0	0	13	0	0	0
qwen2-0.5b-instruct	0	0	13	0	0	0
qwen2-7b-instruct	0	0	11	0	0	0
qwen2-math-7b-instruct	0	0	6	0	0	0
qwen2.5-coder-0.5b-instruct	0	0	13	0	0	0
qwen2-math-72b-instruct	0	0	10	0	0	0
qwen2-math-1.5b-instruct	0	0	4	0	0	0
qwen2.5-coder-32b-instruct	0	0	10	0	0	0
qwen2.5-coder-1.5b-instruct	0	0	11	0	0	0
qwen3-0.6b	0	0	13	0	0	0
qwen2.5-coder-7b-instruct	0	0	10	0	0	0
qwen3-1.7b	0	0	12	0	0	0
qwen3-14b	0	0	12	0	0	0
qwen3-4b	0	0	12	0	0	0
qwen3-8b	0	0	12	0	0	0