gsm8k_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	94.2	28	11	0.65	0.59	0.27
qwen3-32b	93.9	28	11	0.66	0.54	0.38
llama-3.1-70B-instruct	93.8	28	11	0.66	0.66	0
qwen3-14b	93.5	27.7	11	0.68	0.6	0.32
deepseek_r1_distill_llama_70b	92.6	27.6	10	0.72	0.53	0.48
qwen2-math-72b-instruct	92.5	27.3	9	0.73	0.49	0.53
qwen3-8b	92.2	26.8	11	0.74	0.64	0.38
qwen2.5-coder-32b-instruct	92.1	27	9	0.74	0.56	0.48
google_gemma_3_12b_it	92.1	26.9	12	0.74	0.65	0.36
qwen2-72b-instruct	91.6	26.8	10	0.76	0.55	0.53
google_gemma_2_27b_it	90.2	25.6	9	0.82	0.69	0.45
qwen3-4b	89.2	25.3	11	0.85	0.78	0.35
google_gemma_2_9b_it	87.6	24.3	11	0.91	0.74	0.52
qwen2.5-coder-14b-instruct	86.4	24	11	0.94	0.63	0.7
deepseek_r1_distill_qwen_7b	86.4	24.2	11	0.94	0.65	0.68
deepseek_r1_distill_qwen_32b	85.9	25.1	10	0.96	0.57	0.77
mistralai_mixtral_8x22b_instruct_v0.1	84.8	23.3	11	0.99	0.67	0.73
deepseek_r1_distill_qwen_14b	84.7	24.1	11	0.99	0.7	0.7
qwen1.5-72b-chat	84.2	23	10	1	0.73	0.69
google_gemma_3_4b_it	83.4	22.5	13	1	0.86	0.56
qwen1.5-32b-chat	83.2	22.6	11	1	0.71	0.74
qwen2-math-7b-instruct	83	22.2	11	1	0.76	0.7
llama-3.1-8B-instruct	81	21.3	15	1.1	1.1	0.035
qwen2-math-1.5b-instruct	80.4	21.1	11	1.1	0.81	0.73
mistralai_ministral_8b_instruct_2410	79.8	20.8	12	1.1	0.78	0.79
deepseek_r1_distill_llama_8b	79.7	21.4	12	1.1	0.72	0.84
qwen2-7b-instruct	78	20.4	12	1.1	0.74	0.87
qwen2.5-coder-7b-instruct	77.1	20	12	1.2	0.77	0.87
mistralai_mathstral_7b_v0.1	74.9	18.9	11	1.2	0.78	0.9
llama-3.2-3B-instruct	73.4	18	17	1.2	1.2	0
qwen3-1.7b	73.4	18.1	11	1.2	1	0.64
qwen1.5-14b-chat	72.8	17.9	10	1.2	0.9	0.83
deepseek_r1_distill_qwen_1.5b	68.6	17.3	12	1.3	0.79	1
deepseek_v2_lite_chat	67.7	16	11	1.3	0.92	0.9
qwen2.5-coder-3b-instruct	66.2	15.5	11	1.3	0.89	0.95
mistralai_mixtral_8x7b_instruct_v0.1	65.7	15.7	11	1.3	0.87	0.98
qwen1.5-7b-chat	59.1	13.2	12	1.4	0.93	0.99
google_codegemma_1.1_7b_it	54.8	11.8	13	1.4	0.97	0.97
qwen2.5-coder-1.5b-instruct	49.9	10.3	11	1.4	0.93	1
mistralai_mistral_7b_instruct_v0.3	48.3	10	11	1.4	0.93	1
google_gemma_3_1b_it	44.7	8.82	12	1.4	1.1	0.79
qwen3-0.6b	41.6	7.92	13	1.4	1	0.91
mistralai_mistral_7b_instruct_v0.2	40	7.87	11	1.3	0.94	0.97
llama-3.2-1B-instruct	38	7.06	12	1.3	1.3	0
qwen2-1.5b-instruct	37.9	7.42	11	1.3	0.78	1.1
mistralai_mistral_7b_instruct_v0.1	34.3	6.37	11	1.3	0.84	1
google_gemma_7b_it	28.9	5.11	12	1.2	0.92	0.85
qwen2-0.5b-instruct	19.5	3.12	13	1.1	0.62	0.9
qwen1.5-1.8b-chat	15.8	2.53	12	1	0.5	0.87
qwen2.5-coder-0.5b-instruct	14	2.04	13	0.95	0.55	0.78
google_gemma_2b_it	9.9	1.46	12	0.82	0.58	0.58
qwen1.5-0.5b-chat	6.86	1.02	13	0.7	0.34	0.61