mgsm_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	89.4	40.6	11	0.59	NaN	NaN
google_gemma_3_27b_it	89.2	40.5	10	0.59	0.52	0.28
deepseek_r1_distill_llama_70b	86.7	39	11	0.65	NaN	NaN
google_gemma_3_12b_it	86.3	38.3	12	0.66	NaN	NaN
llama-3.1-70B-instruct	86.1	38.6	12	0.66	0.66	0
qwen3-14b	84.4	36.7	11	0.69	0.58	0.37
google_gemma_2_27b_it	83.9	36.8	10	0.7	0.58	0.39
qwen2-72b-instruct	80.6	34	11	0.75	NaN	NaN
qwen3-8b	80.5	33.9	11	0.76	0.64	0.41
deepseek_r1_distill_qwen_32b	80.3	33.9	11	0.76	NaN	NaN
qwen2.5-coder-32b-instruct	79.4	33.2	11	0.77	NaN	NaN
qwen2-math-72b-instruct	79	33.2	11	0.78	NaN	NaN
google_gemma_2_9b_it	79	33.4	11	0.78	0.64	0.44
deepseek_r1_distill_qwen_14b	74.6	30.1	13	0.83	NaN	NaN
qwen3-4b	74.4	29.8	12	0.83	0.72	0.42
google_gemma_3_4b_it	71.8	29.1	13	0.86	0.68	0.53
mistralai_mixtral_8x22b_instruct_v0.1	71.7	28.5	10	0.86	NaN	NaN
qwen2.5-coder-14b-instruct	68.8	26.7	11	0.88	0.69	0.56
qwen1.5-72b-chat	65.3	24.6	11	0.91	NaN	NaN
mistralai_ministral_8b_instruct_2410	64.6	24.2	12	0.91	NaN	NaN
llama-3.1-8B-instruct	64.2	25.1	15	0.91	0.91	0
deepseek_r1_distill_qwen_7b	61.7	22.6	13	0.93	NaN	NaN
mistralai_mathstral_7b_v0.1	60.1	22.6	13	0.93	NaN	NaN
qwen1.5-32b-chat	59.2	21.7	10	0.94	NaN	NaN
qwen2.5-coder-7b-instruct	57.3	20.1	13	0.94	NaN	NaN
qwen2-7b-instruct	57	20.3	13	0.94	NaN	NaN
qwen3-1.7b	54.6	18.9	12	0.95	0.78	0.54
llama-3.2-3B-instruct	51.1	18.3	18	0.95	0.95	0
qwen2-math-7b-instruct	50.8	17	12	0.95	0.74	0.6
qwen1.5-14b-chat	46.8	15.4	11	0.95	0.69	0.65
mistralai_mixtral_8x7b_instruct_v0.1	45.3	15.4	10	0.95	0.67	0.67
deepseek_r1_distill_llama_8b	44.8	14.6	12	0.95	0.7	0.64
google_codegemma_1.1_7b_it	41.5	13.9	13	0.94	0.68	0.65
deepseek_v2_lite_chat	40.2	13	9.9	0.94	NaN	NaN
qwen2.5-coder-3b-instruct	40.2	12.5	12	0.93	0.66	0.67
deepseek_r1_distill_qwen_1.5b	33.4	9.7	12	0.9	0.68	0.59
qwen1.5-7b-chat	31.9	9.27	11	0.89	0.62	0.63
qwen2-math-1.5b-instruct	31.2	8.91	12	0.88	0.71	0.53
mistralai_mistral_7b_instruct_v0.3	29.1	8.63	13	0.87	NaN	NaN
google_gemma_3_1b_it	26.4	7.53	12	0.84	0.62	0.56
mistralai_mistral_7b_instruct_v0.2	24.6	7.34	13	0.82	NaN	NaN
qwen2.5-coder-1.5b-instruct	23.1	6.16	12	0.8	0.53	0.6
qwen3-0.6b	22.8	6.32	13	0.8	0.59	0.54
llama-3.2-1B-instruct	19.5	5.83	8	0.76	0.76	0
google_gemma_7b_it	18.1	5.31	12	0.73	0.54	0.5
mistralai_mistral_7b_instruct_v0.1	17.7	4.72	13	0.73	NaN	NaN
qwen2-1.5b-instruct	14.6	3.68	12	0.67	0.42	0.53
google_gemma_2b_it	5.5	1.79	12	0.43	0.25	0.35
qwen1.5-1.8b-chat	5.38	1.38	11	0.43	0.2	0.38
qwen2-0.5b-instruct	5.05	1.33	13	0.42	0.2	0.37
qwen2.5-coder-0.5b-instruct	4.89	1.31	13	0.41	0.21	0.35
qwen1.5-0.5b-chat	2.21	0.69	13	0.28	0.093	0.26