math500_cot: by models

Home

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	85.3	94.2	42.2	11	1.6	1.3	0.92
deepseek_r1_distill_qwen_32b	83.9	95.2	41.3	10	1.6	1.2	1.1
qwen3-32b	81.9	92.8	39.4	9	1.7	1.3	1.1
deepseek_r1_distill_llama_70b	81.1	94.2	39.4	10	1.7	1.3	1.2
qwen3-14b	81	93	38.7	11	1.8	1.4	1.1
qwen2-math-72b-instruct	80.4	91.4	38.1	7	1.8	1.5	1
qwen3-8b	79.2	93.4	37.3	11	1.8	1.5	1.1
google_gemma_3_12b_it	79	92	37.3	12	1.8	1.5	1
deepseek_r1_distill_qwen_14b	78.9	94.2	37.9	12	1.8	1.3	1.3
qwen3-4b	77.2	92.2	36	12	1.9	1.5	1.1
deepseek_r1_distill_qwen_7b	77.1	93	36.5	12	1.9	1.4	1.3
qwen2.5-coder-32b-instruct	76.1	90.2	35	10	1.9	1.5	1.2
qwen2.5-coder-14b-instruct	71.3	89	31.7	11	2	1.6	1.2
deepseek_r1_distill_llama_8b	70.4	91.2	32.7	12	2	1.5	1.4
qwen2-math-1.5b-instruct	68	68	29.7	1	2.1	NaN	NaN
deepseek_r1_distill_qwen_1.5b	67.6	89.6	30.1	12	2.1	1.5	1.4
qwen3-1.7b	66.7	87	29.1	12	2.1	1.6	1.3
llama-3.1-70B-instruct	65.6	65.8	28.7	12	2.1	2.1	0.21
google_gemma_3_4b_it	65.1	84.8	29.1	12	2.1	1.7	1.3
qwen2-72b-instruct	64.5	83.6	27.2	10	2.1	1.7	1.3
qwen2.5-coder-7b-instruct	61	86.8	25.7	12	2.2	1.6	1.5
google_gemma_2_27b_it	52.3	74.2	20.4	10	2.2	1.9	1.2
qwen2-7b-instruct	51.7	80.2	20.1	12	2.2	1.7	1.5
mistralai_ministral_8b_instruct_2410	49.3	79.2	18.9	12	2.2	1.7	1.5
mistralai_mixtral_8x22b_instruct_v0.1	48.1	77.8	18.4	9	2.2	1.7	1.5
llama-3.1-8B-instruct	47.6	58.6	18	13	2.2	2	1.1
mistralai_mathstral_7b_v0.1	47	76.6	17.7	12	2.2	1.7	1.5
qwen2.5-coder-3b-instruct	46.9	80	18.4	11	2.2	1.5	1.6
google_gemma_2_9b_it	46.4	67.8	16.9	11	2.2	1.9	1.2
llama-3.2-3B-instruct	44.1	55.6	16.6	18	2.2	1.9	1.1
qwen1.5-72b-chat	40.3	73	14.7	10	2.2	1.6	1.5
qwen1.5-32b-chat	38.9	70.4	14.1	9	2.2	1.6	1.5
qwen3-0.6b	33.7	71.4	12.2	13	2.1	1.5	1.5
qwen2.5-coder-1.5b-instruct	32.8	66.4	11.3	12	2.1	1.5	1.5
qwen1.5-14b-chat	30.5	63.6	10.3	11	2.1	1.5	1.4
llama-3.2-1B-instruct	27.9	28.2	9.67	11	2	2	0.23
mistralai_mixtral_8x7b_instruct_v0.1	25.2	61.2	8.44	11	1.9	1.3	1.4
deepseek_v2_lite_chat	22.5	55.8	7.25	9	1.9	1.2	1.4
google_codegemma_1.1_7b_it	20.8	52	6.55	12	1.8	1.3	1.3
qwen1.5-7b-chat	16.5	45.4	5.19	11	1.7	1.1	1.3
google_gemma_3_1b_it	14.5	43.2	5.93	12	1.6	1	1.2
mistralai_mistral_7b_instruct_v0.3	12.9	41	3.86	12	1.5	0.95	1.2
mistralai_mistral_7b_instruct_v0.2	10.1	37.6	2.98	12	1.3	0.82	1.1
qwen2.5-coder-0.5b-instruct	8.03	32.6	2.48	13	1.2	0.7	1
qwen2-1.5b-instruct	6.8	31.6	1.96	12	1.1	0.53	0.99
mistralai_mistral_7b_instruct_v0.1	6.4	26.8	1.99	12	1.1	0.58	0.93
google_gemma_7b_it	5.77	21.4	1.68	12	1	0.68	0.79
qwen2-0.5b-instruct	2.91	19.4	0.985	13	0.75	0.31	0.69
qwen1.5-1.8b-chat	1.42	10.6	0.434	11	0.53	0.16	0.5
qwen1.5-0.5b-chat	0.8	6.6	0.307	12	0.4	0.11	0.38
google_gemma_2b_it	0.117	1.4	0.047	12	0.15	0	0.15