gsm8k_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	94.4	32.6	3	0.63	0.56	0.29
qwen3-14b	93.5	32.2	3	0.68	0.59	0.34
qwen3-32b	93.4	32.2	2	0.69	0.53	0.43
llama-3.1-70B-instruct	92.9	31.9	4	0.71	0.71	0
google_gemma_3_12b_it	92.3	31.4	4	0.73	0.61	0.4
qwen3-8b	92	31	3	0.75	0.63	0.4
qwen2.5-coder-32b-instruct	91.3	30.8	2	0.78	0.56	0.54
deepseek_r1_distill_llama_70b	90.6	30.9	2	0.8	0.57	0.56
google_gemma_2_27b_it	89.8	29.6	2	0.83	0.68	0.48
qwen2-72b-instruct	89.5	30.1	2	0.84	0.56	0.63
qwen3-4b	89.5	29.6	4	0.85	0.76	0.38
google_gemma_2_9b_it	86.8	28	3	0.93	0.74	0.57
qwen2-math-72b-instruct	84.4	27.7	2	1	0.52	0.85
google_gemma_3_4b_it	83.3	26.3	5	1	0.86	0.57
qwen1.5-72b-chat	82.7	26.4	2	1	0.72	0.75
deepseek_r1_distill_qwen_7b	81.9	26.3	4	1.1	0.67	0.82
qwen1.5-32b-chat	81.7	26	2	1.1	0.7	0.8
deepseek_r1_distill_qwen_14b	81.3	26.4	4	1.1	0.72	0.8
qwen2.5-coder-14b-instruct	81	26.1	3	1.1	0.58	0.91
deepseek_r1_distill_qwen_32b	80.3	26.8	2	1.1	0.61	0.91
qwen2-math-7b-instruct	79.8	24.8	4	1.1	0.74	0.82
llama-3.1-8B-instruct	78.3	23.8	7	1.1	1.1	0
mistralai_mixtral_8x22b_instruct_v0.1	78.3	24.4	2	1.1	0.67	0.91
qwen2-math-1.5b-instruct	75.2	22.7	4	1.2	0.81	0.87
mistralai_ministral_8b_instruct_2410	73.9	22.3	4	1.2	0.74	0.96
qwen3-1.7b	73.4	21.6	4	1.2	0.99	0.71
deepseek_r1_distill_llama_8b	73	22.5	4	1.2	0.71	1
qwen2-7b-instruct	71.7	21.6	4	1.2	0.73	1
qwen1.5-14b-chat	70.4	20.3	3	1.3	0.87	0.9
llama-3.2-3B-instruct	67.6	19.4	10	1.3	1.3	0
qwen2.5-coder-7b-instruct	65.4	19.2	4	1.3	0.74	1.1
mistralai_mathstral_7b_v0.1	63.4	18.2	4	1.3	0.73	1.1
deepseek_v2_lite_chat	62.1	17.1	2	1.3	0.89	1
mistralai_mixtral_8x7b_instruct_v0.1	60.8	17.1	3	1.3	0.84	1
deepseek_r1_distill_qwen_1.5b	59.8	17.3	4	1.4	0.72	1.1
qwen1.5-7b-chat	55.2	14.5	3	1.4	0.91	1
qwen2.5-coder-3b-instruct	54.9	14.6	4	1.4	0.81	1.1
google_codegemma_1.1_7b_it	50.3	13	5	1.4	0.89	1.1
google_gemma_3_1b_it	44.6	10.9	4	1.4	1.1	0.86
mistralai_mistral_7b_instruct_v0.3	43.4	11	4	1.4	0.82	1.1
qwen3-0.6b	38.2	8.87	5	1.3	0.92	0.98
qwen2.5-coder-1.5b-instruct	37.4	8.72	4	1.3	0.82	1
mistralai_mistral_7b_instruct_v0.2	36.8	8.82	4	1.3	0.86	1
llama-3.2-1B-instruct	29	6.29	13	1.2	1.2	0
google_gemma_7b_it	27.5	6.17	4	1.2	0.85	0.89
mistralai_mistral_7b_instruct_v0.1	22.5	4.79	4	1.1	0.63	0.96
qwen2-1.5b-instruct	20.5	4.43	4	1.1	0.53	0.98
qwen1.5-1.8b-chat	11.1	2.12	3	0.86	0.41	0.76
qwen2.5-coder-0.5b-instruct	9.83	1.87	5	0.82	0.38	0.73
google_gemma_2b_it	9.17	1.77	4	0.79	0.49	0.62
qwen2-0.5b-instruct	8.49	1.58	5	0.77	0.33	0.69
qwen1.5-0.5b-chat	3.12	0.622	5	0.48	0.14	0.46