math_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	86.5	50.3	2	0.48	0.4	0.27
qwen3-14b	82.4	47	2	0.54	0.43	0.33
google_gemma_3_12b_it	80.2	45.2	3	0.56	0.46	0.33
qwen3-4b	78.1	43.6	3	0.58	0.47	0.35
qwen3-8b	77.8	43.4	2	0.59	0.45	0.38
qwen3-32b	75.2	42.2	2	0.61	0.39	0.47
google_gemma_3_4b_it	71.5	38.7	4	0.64	0.53	0.36
deepseek_r1_distill_llama_70b	70.9	40.7	1	0.64	NaN	NaN
deepseek_r1_distill_qwen_7b	68.4	39.1	3	0.66	0.35	0.56
deepseek_r1_distill_llama_8b	65.6	37	3	0.67	0.37	0.56
qwen3-1.7b	63	32.8	3	0.68	0.52	0.44
deepseek_r1_distill_qwen_14b	59.9	34.1	3	0.69	0.38	0.58
deepseek_r1_distill_qwen_1.5b	59	32	3	0.7	0.38	0.58
llama-3.1-70B-instruct	58.5	29.8	3	0.7	0.69	0.077
deepseek_r1_distill_qwen_32b	57.8	32.6	1	0.7	NaN	NaN
qwen2.5-coder-32b-instruct	54.4	27.8	1	0.7	NaN	NaN
qwen2-72b-instruct	54.1	26.7	1	0.7	NaN	NaN
google_gemma_2_9b_it	44	20	3	0.7	0.57	0.41
qwen1.5-72b-chat	41.9	19.2	1	0.7	NaN	NaN
qwen2.5-coder-14b-instruct	41.4	19.4	2	0.7	0.45	0.53
qwen1.5-32b-chat	39.6	17.8	2	0.69	0.5	0.47
qwen2.5-coder-7b-instruct	35.3	16	3	0.68	0.41	0.54
mistralai_mixtral_8x22b_instruct_v0.1	33.1	14.4	2	0.67	0.45	0.49
qwen2-7b-instruct	32.8	14.8	3	0.66	0.42	0.51
mistralai_mathstral_7b_v0.1	32.7	14.2	3	0.66	0.44	0.5
qwen3-0.6b	32.4	14	4	0.66	0.47	0.46
google_gemma_3_1b_it	32.1	13.9	3	0.66	0.51	0.42
llama-3.2-3B-instruct	30.9	13.2	10	0.65	0.65	0.055
qwen2.5-coder-3b-instruct	29.9	13.1	3	0.65	0.39	0.52
llama-3.1-8B-instruct	28.7	12.2	7	0.64	0.64	0.057
qwen1.5-14b-chat	28.5	11.9	2	0.64	0.45	0.46
mistralai_ministral_8b_instruct_2410	28.5	12	3	0.64	0.41	0.49
mistralai_mixtral_8x7b_instruct_v0.1	22.7	9.09	2	0.59	0.39	0.45
deepseek_v2_lite_chat	22	8.79	2	0.59	0.38	0.45
qwen2.5-coder-1.5b-instruct	19.9	8.05	3	0.56	0.32	0.47
google_codegemma_1.1_7b_it	18.1	7.03	4	0.55	0.36	0.41
qwen1.5-7b-chat	17.6	6.8	3	0.54	0.32	0.43
llama-3.2-1B-instruct	12.4	4.6	12	0.47	0.46	0.042
google_gemma_7b_it	11.3	4.25	3	0.45	0.32	0.32
mistralai_mistral_7b_instruct_v0.3	11.1	4.05	3	0.44	0.26	0.36
mistralai_mistral_7b_instruct_v0.2	9.4	3.36	3	0.41	0.26	0.32
qwen2-1.5b-instruct	8.98	3.4	3	0.4	0.19	0.35
google_gemma_2b_it	6.05	2.29	3	0.34	0.21	0.26
qwen2.5-coder-0.5b-instruct	5.22	1.95	4	0.31	0.14	0.28
mistralai_mistral_7b_instruct_v0.1	4.87	1.73	3	0.3	0.15	0.27
qwen2-0.5b-instruct	3.92	1.39	4	0.27	0.11	0.25
qwen1.5-1.8b-chat	3.29	1.15	3	0.25	0.11	0.23
qwen1.5-0.5b-chat	0.435	0.145	4	0.093	0.024	0.09