math_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	86.1	46.6	8	0.49	0.41	0.26
qwen3-14b	82.8	44	10	0.53	0.43	0.32
google_gemma_3_12b_it	80.1	41.9	11	0.56	0.46	0.32
qwen3-4b	78.6	40.7	12	0.58	0.47	0.34
qwen3-8b	78.2	40.5	10	0.58	0.46	0.36
qwen3-32b	76.9	40.2	10	0.6	0.41	0.43
deepseek_r1_distill_llama_70b	73.6	39.4	9	0.62	0.46	0.43
google_gemma_3_4b_it	72	36	13	0.63	0.53	0.35
deepseek_r1_distill_qwen_7b	71.8	38.8	12	0.64	0.37	0.52
deepseek_r1_distill_llama_8b	69.1	36.2	12	0.65	0.4	0.52
qwen3-1.7b	63.9	30.5	12	0.68	0.53	0.42
llama-3.1-70B-instruct	62.9	29.8	12	0.68	0.68	0.06
deepseek_r1_distill_qwen_1.5b	62.6	31.6	12	0.68	0.39	0.56
deepseek_r1_distill_qwen_14b	61.3	32.7	11	0.69	0.41	0.55
qwen2-72b-instruct	59.3	27.4	10	0.69	0.51	0.47
deepseek_r1_distill_qwen_32b	58	30.6	9	0.7	0.44	0.54
qwen2.5-coder-32b-instruct	55.1	26.1	10	0.7	0.5	0.5
google_gemma_2_27b_it	51.9	22.5	7	0.71	0.58	0.4
qwen2.5-coder-14b-instruct	49.4	21.7	9	0.71	0.48	0.52
qwen2.5-coder-7b-instruct	45.9	20.1	11	0.7	0.45	0.55
google_gemma_2_9b_it	44.9	18.2	10	0.7	0.59	0.39
qwen1.5-72b-chat	43.5	17.8	10	0.7	0.53	0.46
qwen1.5-32b-chat	42.2	17.1	10	0.7	0.52	0.46
mistralai_mixtral_8x22b_instruct_v0.1	41.8	17.1	10	0.7	0.5	0.49
mistralai_mathstral_7b_v0.1	39.7	15.9	11	0.69	0.49	0.49
mistralai_ministral_8b_instruct_2410	38.3	15.2	11	0.69	0.48	0.49
llama-3.2-3B-instruct	37.8	15.1	19	0.69	0.68	0.064
llama-3.1-8B-instruct	37.5	15	16	0.68	0.68	0.064
qwen3-0.6b	37	14.8	13	0.68	0.5	0.46
qwen2.5-coder-3b-instruct	36.7	14.9	12	0.68	0.44	0.52
qwen2-7b-instruct	36.2	15	11	0.68	0.44	0.51
google_gemma_3_1b_it	33.3	13	12	0.67	0.53	0.41
qwen1.5-14b-chat	31.5	11.8	10	0.66	0.47	0.46
qwen2.5-coder-1.5b-instruct	27.2	10.1	12	0.63	0.4	0.48
deepseek_v2_lite_chat	25.7	9.1	10	0.62	0.44	0.44
mistralai_mixtral_8x7b_instruct_v0.1	25.6	9.16	10	0.62	0.43	0.45
google_codegemma_1.1_7b_it	20.5	7.12	13	0.57	0.4	0.41
qwen1.5-7b-chat	20.5	7.02	10	0.57	0.38	0.43
llama-3.2-1B-instruct	18.7	6.38	21	0.55	0.55	0.051
qwen2-1.5b-instruct	15.1	5.06	12	0.51	0.3	0.41
mistralai_mistral_7b_instruct_v0.3	13.2	4.26	11	0.48	0.3	0.37
google_gemma_7b_it	11.9	3.91	12	0.46	0.34	0.31
mistralai_mistral_7b_instruct_v0.2	10.3	3.2	11	0.43	0.28	0.33
qwen2-0.5b-instruct	7.62	2.41	13	0.38	0.2	0.32
qwen2.5-coder-0.5b-instruct	7.22	2.29	13	0.37	0.19	0.32
mistralai_mistral_7b_instruct_v0.1	7.06	2.21	11	0.36	0.21	0.3
google_gemma_2b_it	6.33	2.11	12	0.34	0.23	0.25
qwen1.5-1.8b-chat	5.22	1.67	10	0.31	0.15	0.28
qwen1.5-0.5b-chat	1.3	0.486	13	0.16	0.049	0.15