mgsm_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	89.2	43.5	3	0.59	0.5	0.32
qwen3-32b	89	43.3	2.3	0.6	NaN	NaN
google_gemma_3_12b_it	86.2	41.2	3.4	0.66	NaN	NaN
deepseek_r1_distill_llama_70b	85.8	41.2	2.2	0.67	NaN	NaN
qwen3-14b	84.4	39.6	3	0.69	0.58	0.38
google_gemma_2_27b_it	82.5	38.8	1	0.73	NaN	NaN
qwen3-8b	80.1	36.5	3	0.76	0.62	0.44
qwen2-72b-instruct	79.4	35.9	2.2	0.77	NaN	NaN
deepseek_r1_distill_qwen_32b	79.4	36.1	2.2	0.77	NaN	NaN
google_gemma_2_9b_it	77.8	35.5	3	0.79	0.62	0.49
qwen2.5-coder-32b-instruct	77.3	34.6	2.2	0.8	NaN	NaN
llama-3.1-70B-instruct	77	35.8	4	0.8	0.8	0
qwen3-4b	73.7	32.1	4	0.84	0.71	0.44
qwen2-math-72b-instruct	73.7	32.7	2.2	0.84	NaN	NaN
deepseek_r1_distill_qwen_14b	73.7	32.3	3.4	0.84	NaN	NaN
google_gemma_3_4b_it	71.2	31.3	4	0.86	0.66	0.55
mistralai_mixtral_8x22b_instruct_v0.1	68	28.9	2.3	0.89	NaN	NaN
qwen2.5-coder-14b-instruct	64.9	26.9	3	0.91	0.67	0.61
qwen1.5-72b-chat	63.7	26	2.2	0.92	NaN	NaN
qwen1.5-32b-chat	58.6	23.5	2.3	0.94	NaN	NaN
deepseek_r1_distill_qwen_7b	58	22.9	3.4	0.94	NaN	NaN
qwen3-1.7b	54.9	21.1	4	0.95	0.77	0.56
qwen2-7b-instruct	52.6	19.9	3.4	0.95	NaN	NaN
mistralai_ministral_8b_instruct_2410	51	19.2	3.4	0.95	NaN	NaN
mistralai_mathstral_7b_v0.1	50.9	20.1	3.4	0.95	NaN	NaN
qwen2.5-coder-7b-instruct	50.8	18.9	3.4	0.95	NaN	NaN
mistralai_mixtral_8x7b_instruct_v0.1	44.4	16.7	3	0.95	0.64	0.7
llama-3.1-8B-instruct	44	16.8	7	0.95	0.95	0
qwen1.5-14b-chat	43.5	15.6	3	0.95	0.67	0.66
qwen2-math-7b-instruct	43.3	15.4	4	0.94	0.68	0.66
deepseek_r1_distill_llama_8b	40.4	14.2	4	0.94	0.66	0.67
google_codegemma_1.1_7b_it	38.4	14	4	0.93	0.63	0.68
deepseek_v2_lite_chat	37.2	13	2.3	0.92	NaN	NaN
llama-3.2-3B-instruct	35.2	12.6	10	0.91	0.91	0
qwen2.5-coder-3b-instruct	34.3	11.5	4	0.91	0.61	0.67
qwen1.5-7b-chat	29.4	9.3	3	0.87	0.59	0.63
deepseek_r1_distill_qwen_1.5b	27.2	8.45	4	0.85	0.61	0.59
google_gemma_3_1b_it	26	8.38	4	0.84	0.6	0.59
mistralai_mistral_7b_instruct_v0.3	24.5	7.91	3.4	0.82	NaN	NaN
mistralai_mistral_7b_instruct_v0.2	23.9	7.99	3.4	0.81	NaN	NaN
qwen3-0.6b	21.9	6.84	4	0.79	0.55	0.57
qwen2-math-1.5b-instruct	20	5.99	4	0.76	0.54	0.54
google_gemma_7b_it	17.8	5.79	4	0.73	0.51	0.52
qwen2.5-coder-1.5b-instruct	17	4.89	4	0.72	0.43	0.57
mistralai_mistral_7b_instruct_v0.1	12.1	3.48	3.4	0.62	NaN	NaN
qwen2-1.5b-instruct	8.15	2.3	4	0.52	0.26	0.46
google_gemma_2b_it	5.05	1.89	4	0.42	0.21	0.36
qwen2.5-coder-0.5b-instruct	3.46	1.08	4	0.35	0.14	0.32
qwen1.5-1.8b-chat	3.37	1.09	3	0.34	0.13	0.32
qwen2-0.5b-instruct	2.97	0.949	4	0.32	0.12	0.3
qwen1.5-0.5b-chat	1.42	0.6	4	0.23	0.055	0.22