ds1000: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen2.5-coder-14b-instruct	43.2	30.9	12	1.6	1.3	0.88
qwen3-14b	38	26.2	12	1.5	1.3	0.77
qwen2.5-coder-7b-instruct	35.7	24.5	10	1.5	1.2	0.92
google_gemma_3_12b_it	32.8	21.5	11	1.5	1.4	0.57
qwen3-8b	31.2	20.3	12	1.5	1.2	0.79
qwen3-4b	28.9	18.5	12	1.4	1.2	0.79
mistralai_ministral_8b_instruct_2410	24.7	15.3	11	1.4	1	0.9
qwen2.5-coder-3b-instruct	23.8	14.6	12	1.3	1	0.89
qwen2-7b-instruct	23.6	14.4	11	1.3	1.1	0.79
mistralai_mathstral_7b_v0.1	22.6	13.7	11	1.3	0.97	0.89
llama-3.1-8B-instruct	22.3	13.6	15	1.3	1.3	0.15
google_codegemma_1.1_7b_it	20.9	12.7	13	1.3	0.97	0.84
qwen1.5-14b-chat	20.5	12.3	12	1.3	1	0.77
qwen2-math-7b-instruct	20.2	12.2	6	1.3	1	0.79
google_gemma_3_4b_it	19.5	11.2	13	1.3	1.1	0.56
mistralai_mistral_7b_instruct_v0.3	18.2	10.5	11	1.2	0.98	0.73
deepseek_v2_lite_chat	17.4	10.2	11	1.2	0.88	0.81
deepseek_r1_distill_qwen_14b	16.1	9.82	11	1.2	0.82	0.83
qwen3-1.7b	15.2	8.08	12	1.1	0.97	0.58
deepseek_r1_distill_qwen_7b	14.8	8.42	11	1.1	0.81	0.78
qwen2.5-coder-1.5b-instruct	14.1	8.14	11	1.1	0.7	0.85
mistralai_mistral_7b_instruct_v0.2	13.6	7.55	10	1.1	0.85	0.67
google_gemma_2_9b_it	10.6	5.6	12	0.97	0.83	0.51
llama-3.2-3B-instruct	10.1	5.18	17	0.95	0.95	0.072
mistralai_mistral_7b_instruct_v0.1	9.23	4.58	11	0.92	0.65	0.65
qwen1.5-7b-chat	8.67	4.54	12	0.89	0.62	0.64
qwen3-0.6b	7.42	3.48	13	0.83	0.63	0.54
deepseek_r1_distill_llama_8b	6.85	3.71	13	0.8	0.54	0.59
qwen2-1.5b-instruct	5.78	2.68	13	0.74	0.47	0.57
google_gemma_7b_it	4.75	1.94	13	0.67	0.6	0.3
llama-3.2-1B-instruct	4.58	2.16	12	0.66	0.66	0.039
qwen2.5-coder-0.5b-instruct	4.55	2.07	13	0.66	0.38	0.53
qwen2-math-1.5b-instruct	4.08	2.21	4	0.63	0.34	0.52
google_gemma_3_1b_it	3.83	1.65	13	0.61	0.5	0.34
qwen2-0.5b-instruct	2.2	0.906	13	0.46	0.27	0.37
deepseek_r1_distill_qwen_1.5b	1.78	0.971	13	0.42	0.22	0.35
qwen1.5-1.8b-chat	1.37	0.6	12	0.37	0.17	0.32
google_gemma_2b_it	0.385	0.17	13	0.2	0.096	0.17
qwen1.5-0.5b-chat	0.354	0.13	13	0.19	0.071	0.17