ds1000: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen2.5-coder-14b-instruct	38.6	29	3	1.5	1.2	1
qwen3-14b	37	27.3	3	1.5	1.3	0.84
google_gemma_3_12b_it	32.2	22.9	4	1.5	1.3	0.67
qwen3-8b	30	21.3	3	1.4	1.2	0.88
qwen3-4b	28.5	19.9	4	1.4	1.2	0.85
qwen2.5-coder-7b-instruct	28.5	20.4	4	1.4	1	0.99
qwen2-7b-instruct	21.2	14.1	4	1.3	0.97	0.85
google_gemma_3_4b_it	19.2	12.3	5	1.2	1.1	0.6
qwen1.5-14b-chat	18.8	12.3	3	1.2	0.94	0.8
google_codegemma_1.1_7b_it	18.6	12.2	5	1.2	0.9	0.84
mistralai_ministral_8b_instruct_2410	17.7	11.4	4	1.2	0.81	0.9
qwen2.5-coder-3b-instruct	16.8	10.9	4	1.2	0.76	0.91
mistralai_mathstral_7b_v0.1	16.8	10.7	4	1.2	0.81	0.86
mistralai_mistral_7b_instruct_v0.3	16.5	10.6	4	1.2	0.86	0.8
qwen3-1.7b	14.6	8.76	4	1.1	0.89	0.68
deepseek_r1_distill_qwen_14b	14	9.24	4	1.1	0.69	0.85
llama-3.1-8B-instruct	13.9	8.71	7	1.1	1.1	0.085
deepseek_v2_lite_chat	13.7	8.4	3	1.1	0.77	0.77
qwen2-math-7b-instruct	13.4	8.58	2	1.1	0.68	0.83
deepseek_r1_distill_qwen_7b	11.8	7.34	4	1	0.66	0.78
mistralai_mistral_7b_instruct_v0.2	11.7	7.17	4	1	0.72	0.71
google_gemma_2_9b_it	10.5	6.19	3	0.97	0.78	0.58
qwen2.5-coder-1.5b-instruct	8.55	5.16	4	0.88	0.47	0.75
qwen1.5-7b-chat	6.77	3.9	3	0.79	0.5	0.62
llama-3.2-3B-instruct	6.31	3.39	10	0.77	0.77	0.075
qwen3-0.6b	5.92	3.14	5	0.75	0.52	0.53
mistralai_mistral_7b_instruct_v0.1	5.83	3.1	4	0.74	0.46	0.58
deepseek_r1_distill_llama_8b	4.83	2.98	4	0.68	0.39	0.55
google_gemma_7b_it	4.6	2.22	4	0.66	0.56	0.35
google_gemma_3_1b_it	3.6	1.74	4	0.59	0.47	0.36
qwen2.5-coder-0.5b-instruct	2.96	1.49	5	0.54	0.28	0.45
qwen2-1.5b-instruct	2.93	1.51	4	0.53	0.24	0.48
llama-3.2-1B-instruct	1.49	0.786	13	0.38	0.38	0.028
deepseek_r1_distill_qwen_1.5b	1.23	0.73	4	0.35	0.2	0.29
qwen2-math-1.5b-instruct	1.13	0.606	3	0.33	0.16	0.29
qwen2-0.5b-instruct	1.12	0.447	5	0.33	0.17	0.28
qwen1.5-1.8b-chat	0.533	0.258	3	0.23	0.099	0.21
google_gemma_2b_it	0.45	0.197	4	0.21	0.038	0.21
qwen1.5-0.5b-chat	0.06	0.0376	5	0.077	0	0.077