math500_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	85.5	92.8	44.9	3	1.6	1.2	0.99
qwen3-32b	81.5	89.4	41.6	2	1.7	1.2	1.3
deepseek_r1_distill_qwen_32b	81.1	89	41.8	2	1.8	1.2	1.3
qwen3-14b	80.7	89.4	41	3	1.8	1.4	1.1
deepseek_r1_distill_llama_70b	78.9	87.2	40	2	1.8	1.3	1.3
google_gemma_3_12b_it	78.3	87.4	39.4	3	1.8	1.5	1.1
qwen3-8b	78.3	86.6	39.2	3	1.8	1.5	1.1
qwen3-4b	77.1	88	38.3	4	1.9	1.5	1.2
deepseek_r1_distill_qwen_14b	76.3	89	38.5	3	1.9	1.4	1.3
qwen2.5-coder-32b-instruct	74.4	82.2	36.3	2	2	1.5	1.2
deepseek_r1_distill_qwen_7b	72.7	87.6	36	3	2	1.4	1.4
qwen2-math-7b-instruct	70.7	79.2	33.6	2	2	1.6	1.3
qwen2.5-coder-14b-instruct	67.2	80.4	31.2	3	2.1	1.6	1.4
deepseek_r1_distill_llama_8b	67.1	84.8	32.7	4	2.1	1.5	1.5
qwen3-1.7b	65.4	79.6	30.4	4	2.1	1.7	1.3
google_gemma_3_4b_it	65	79.6	31	4	2.1	1.7	1.3
deepseek_r1_distill_qwen_1.5b	64.6	82.6	30.3	4	2.1	1.5	1.5
llama-3.1-70B-instruct	63.4	63.6	29.4	4	2.2	2.1	0.2
qwen2-72b-instruct	63	72.2	28.7	2	2.2	1.7	1.4
qwen2-math-1.5b-instruct	61.5	71.8	27.7	2	2.2	1.6	1.4
qwen2.5-coder-7b-instruct	53.7	71.2	23.4	3	2.2	1.6	1.5
google_gemma_2_27b_it	52.8	52.8	22.1	1	2.2	NaN	NaN
qwen2-7b-instruct	45.7	64.8	18.5	3	2.2	1.6	1.6
llama-3.1-8B-instruct	45.1	45.4	18.7	6	2.2	2.2	0.25
google_gemma_2_9b_it	44.9	56.4	17.8	3	2.2	1.9	1.2
mistralai_mixtral_8x22b_instruct_v0.1	43.8	57.8	17.9	2	2.2	1.5	1.7
qwen2.5-coder-3b-instruct	40.8	65.2	16.9	4	2.2	1.4	1.7
mistralai_mathstral_7b_v0.1	39.8	60.4	16.1	3	2.2	1.4	1.7
mistralai_ministral_8b_instruct_2410	39.8	58.2	15.8	3	2.2	1.5	1.6
llama-3.2-3B-instruct	38.3	38.8	15.1	10	2.2	2.2	0.27
qwen1.5-72b-chat	37.9	48.4	14.7	2	2.2	1.6	1.4
qwen1.5-32b-chat	37.1	49.8	14.6	2	2.2	1.5	1.6
qwen3-0.6b	31.6	50.8	12.3	3	2.1	1.4	1.6
qwen1.5-14b-chat	28.3	44.4	10.1	3	2	1.5	1.4
qwen2.5-coder-1.5b-instruct	26.2	49.6	9.64	4	2	1.2	1.5
mistralai_mixtral_8x7b_instruct_v0.1	23.3	40.4	8.16	3	1.9	1.2	1.4
llama-3.2-1B-instruct	20.5	20.6	7.33	11	1.8	1.8	0.16
deepseek_v2_lite_chat	18.9	27.2	6.46	2	1.8	1.2	1.3
google_codegemma_1.1_7b_it	18.6	34.8	6.2	4	1.7	1.2	1.3
qwen1.5-7b-chat	15.9	28.6	5.26	3	1.6	1.1	1.2
google_gemma_3_1b_it	13.6	30.2	5.92	4	1.5	0.91	1.2
mistralai_mistral_7b_instruct_v0.3	10.5	20.6	3.39	3	1.4	0.81	1.1
mistralai_mistral_7b_instruct_v0.2	9	19	3.06	3	1.3	0.71	1.1
qwen2.5-coder-0.5b-instruct	6	18	2.38	4	1.1	0.44	0.97
google_gemma_7b_it	5.75	13.6	1.86	4	1	0.64	0.82
qwen2-1.5b-instruct	4.65	13.2	1.54	4	0.94	0.42	0.84
mistralai_mistral_7b_instruct_v0.1	4.13	9.8	1.42	3	0.89	0.39	0.8
qwen2-0.5b-instruct	2.05	6.8	0.756	4	0.63	0.21	0.6
qwen1.5-1.8b-chat	0.867	2.2	0.293	3	0.41	0.16	0.38
google_gemma_2b_it	0.35	1.4	0.104	4	0.26	0	0.26
qwen1.5-0.5b-chat	0.3	1.2	0.145	4	0.24	0	0.24