ds1000: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen2.5-coder-14b-instruct	38.6	54	29	3	1.5	1.2	1
qwen3-14b	37	47.9	27.3	3	1.5	1.3	0.84
google_gemma_3_12b_it	32.2	40.3	22.9	4	1.5	1.3	0.67
qwen3-8b	30	41.5	21.3	3	1.4	1.2	0.88
qwen3-4b	28.5	42.6	19.9	4	1.4	1.2	0.85
qwen2.5-coder-7b-instruct	28.5	46.2	20.4	4	1.4	1	0.99
qwen2-7b-instruct	21.2	35	14.1	4	1.3	0.97	0.85
google_gemma_3_4b_it	19.2	27.7	12.3	5	1.2	1.1	0.6
qwen1.5-14b-chat	18.8	28.9	12.3	3	1.2	0.94	0.8
google_codegemma_1.1_7b_it	18.6	35.6	12.2	5	1.2	0.9	0.84
mistralai_ministral_8b_instruct_2410	17.7	34.7	11.4	4	1.2	0.81	0.9
qwen2.5-coder-3b-instruct	16.8	33.2	10.9	4	1.2	0.76	0.91
mistralai_mathstral_7b_v0.1	16.8	32.2	10.7	4	1.2	0.81	0.86
mistralai_mistral_7b_instruct_v0.3	16.5	29.3	10.6	4	1.2	0.86	0.8
qwen3-1.7b	14.6	24.1	8.76	4	1.1	0.89	0.68
deepseek_r1_distill_qwen_14b	14	29.8	9.24	4	1.1	0.69	0.85
llama-3.1-8B-instruct	13.9	14	8.71	7	1.1	1.1	0.085
deepseek_v2_lite_chat	13.7	23.1	8.4	3	1.1	0.77	0.77
qwen2-math-7b-instruct	13.4	20.3	8.58	2	1.1	0.68	0.83
deepseek_r1_distill_qwen_7b	11.8	24.8	7.34	4	1	0.66	0.78
mistralai_mistral_7b_instruct_v0.2	11.7	22.5	7.17	4	1	0.72	0.71
google_gemma_2_9b_it	10.5	16.4	6.19	3	0.97	0.78	0.58
qwen2.5-coder-1.5b-instruct	8.55	21.1	5.16	4	0.88	0.47	0.75
qwen1.5-7b-chat	6.77	13.3	3.9	3	0.79	0.5	0.62
llama-3.2-3B-instruct	6.31	6.4	3.39	10	0.77	0.77	0.075
qwen3-0.6b	5.92	13.5	3.14	5	0.75	0.52	0.53
mistralai_mistral_7b_instruct_v0.1	5.83	13.4	3.1	4	0.74	0.46	0.58
deepseek_r1_distill_llama_8b	4.83	12	2.98	4	0.68	0.39	0.55
google_gemma_7b_it	4.6	7.2	2.22	4	0.66	0.56	0.35
google_gemma_3_1b_it	3.6	6.4	1.74	4	0.59	0.47	0.36
qwen2.5-coder-0.5b-instruct	2.96	8.9	1.49	5	0.54	0.28	0.45
qwen2-1.5b-instruct	2.93	8.6	1.51	4	0.53	0.24	0.48
llama-3.2-1B-instruct	1.49	1.5	0.786	13	0.38	0.38	0.028
deepseek_r1_distill_qwen_1.5b	1.23	3.2	0.73	4	0.35	0.2	0.29
qwen2-math-1.5b-instruct	1.13	2.7	0.606	3	0.33	0.16	0.29
qwen2-0.5b-instruct	1.12	3.6	0.447	5	0.33	0.17	0.28
qwen1.5-1.8b-chat	0.533	1.3	0.258	3	0.23	0.099	0.21
google_gemma_2b_it	0.45	1.7	0.197	4	0.21	0.038	0.21
qwen1.5-0.5b-chat	0.06	0.3	0.0376	5	0.077	0	0.077