math500_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	85.5	44.9	3	1.6	1.2	0.99
qwen3-32b	81.5	41.6	2	1.7	1.2	1.3
deepseek_r1_distill_qwen_32b	81.1	41.8	2	1.8	1.2	1.3
qwen3-14b	80.7	41	3	1.8	1.4	1.1
deepseek_r1_distill_llama_70b	78.9	40	2	1.8	1.3	1.3
google_gemma_3_12b_it	78.3	39.4	3	1.8	1.5	1.1
qwen3-8b	78.3	39.2	3	1.8	1.5	1.1
qwen3-4b	77.1	38.3	4	1.9	1.5	1.2
deepseek_r1_distill_qwen_14b	76.3	38.5	3	1.9	1.4	1.3
qwen2.5-coder-32b-instruct	74.4	36.3	2	2	1.5	1.2
deepseek_r1_distill_qwen_7b	72.7	36	3	2	1.4	1.4
qwen2-math-7b-instruct	70.7	33.6	2	2	1.6	1.3
qwen2.5-coder-14b-instruct	67.2	31.2	3	2.1	1.6	1.4
deepseek_r1_distill_llama_8b	67.1	32.7	4	2.1	1.5	1.5
qwen3-1.7b	65.4	30.4	4	2.1	1.7	1.3
google_gemma_3_4b_it	65	31	4	2.1	1.7	1.3
deepseek_r1_distill_qwen_1.5b	64.6	30.3	4	2.1	1.5	1.5
llama-3.1-70B-instruct	63.4	29.4	4	2.2	2.1	0.2
qwen2-72b-instruct	63	28.7	2	2.2	1.7	1.4
qwen2-math-1.5b-instruct	61.5	27.7	2	2.2	1.6	1.4
qwen2.5-coder-7b-instruct	53.7	23.4	3	2.2	1.6	1.5
google_gemma_2_27b_it	52.8	22.1	1	2.2	NaN	NaN
qwen2-7b-instruct	45.7	18.5	3	2.2	1.6	1.6
llama-3.1-8B-instruct	45.1	18.7	6	2.2	2.2	0.25
google_gemma_2_9b_it	44.9	17.8	3	2.2	1.9	1.2
mistralai_mixtral_8x22b_instruct_v0.1	43.8	17.9	2	2.2	1.5	1.7
qwen2.5-coder-3b-instruct	40.8	16.9	4	2.2	1.4	1.7
mistralai_mathstral_7b_v0.1	39.8	16.1	3	2.2	1.4	1.7
mistralai_ministral_8b_instruct_2410	39.8	15.8	3	2.2	1.5	1.6
llama-3.2-3B-instruct	38.3	15.1	10	2.2	2.2	0.27
qwen1.5-72b-chat	37.9	14.7	2	2.2	1.6	1.4
qwen1.5-32b-chat	37.1	14.6	2	2.2	1.5	1.6
qwen3-0.6b	31.6	12.3	3	2.1	1.4	1.6
qwen1.5-14b-chat	28.3	10.1	3	2	1.5	1.4
qwen2.5-coder-1.5b-instruct	26.2	9.64	4	2	1.2	1.5
mistralai_mixtral_8x7b_instruct_v0.1	23.3	8.16	3	1.9	1.2	1.4
llama-3.2-1B-instruct	20.5	7.33	11	1.8	1.8	0.16
deepseek_v2_lite_chat	18.9	6.46	2	1.8	1.2	1.3
google_codegemma_1.1_7b_it	18.6	6.2	4	1.7	1.2	1.3
qwen1.5-7b-chat	15.9	5.26	3	1.6	1.1	1.2
google_gemma_3_1b_it	13.6	5.92	4	1.5	0.91	1.2
mistralai_mistral_7b_instruct_v0.3	10.5	3.39	3	1.4	0.81	1.1
mistralai_mistral_7b_instruct_v0.2	9	3.06	3	1.3	0.71	1.1
qwen2.5-coder-0.5b-instruct	6	2.38	4	1.1	0.44	0.97
google_gemma_7b_it	5.75	1.86	4	1	0.64	0.82
qwen2-1.5b-instruct	4.65	1.54	4	0.94	0.42	0.84
mistralai_mistral_7b_instruct_v0.1	4.13	1.42	3	0.89	0.39	0.8
qwen2-0.5b-instruct	2.05	0.756	4	0.63	0.21	0.6
qwen1.5-1.8b-chat	0.867	0.293	3	0.41	0.16	0.38
google_gemma_2b_it	0.35	0.104	4	0.26	0	0.26
qwen1.5-0.5b-chat	0.3	0.145	4	0.24	0	0.24