math500_cot: by models

Home

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	86.5	95.6	43.3	8	1.5	1.2	0.9
deepseek_r1_distill_qwen_32b	84.8	98.2	42.1	1.1e+03	1.6	1.2	1.1
qwen3-32b	82.3	98	39.9	7e+02	1.7	1.3	1.1
qwen3-14b	82.3	97.8	39.8	1.1e+03	1.7	1.4	1
deepseek_r1_distill_llama_70b	82.1	97.8	40.1	1.1e+03	1.7	1.3	1.2
qwen2-math-72b-instruct	81.1	94.6	38.7	35	1.7	1.5	0.98
google_gemma_3_12b_it	80.2	98.4	38.3	1.1e+03	1.8	1.5	1
qwen3-8b	79.7	98.2	37.8	1.1e+03	1.8	1.4	1.1
qwen3-4b	78.2	97.8	36.8	1.1e+03	1.8	1.5	1.1
deepseek_r1_distill_qwen_7b	78.1	98.2	37.2	1.1e+03	1.8	1.4	1.2
qwen2.5-coder-32b-instruct	77	97.2	35.7	1.1e+03	1.9	1.5	1.1
qwen2.5-coder-14b-instruct	72.6	98	32.7	1.1e+03	2	1.6	1.2
deepseek_r1_distill_llama_8b	70.5	95.2	32.8	71	2	1.5	1.4
deepseek_r1_distill_qwen_1.5b	68.7	98	30.7	1.1e+03	2.1	1.5	1.4
qwen3-1.7b	67.9	97.6	29.9	1.1e+03	2.1	1.6	1.3
llama-3.1-70B-instruct	66.4	88.6	29.1	1e+03	2.1	1.7	1.3
google_gemma_3_4b_it	65.7	97.2	29.4	1.1e+03	2.1	1.7	1.3
qwen2-72b-instruct	65.3	96.4	27.9	2.2e+02	2.1	1.7	1.3
qwen2.5-coder-7b-instruct	62.4	96.8	26.5	1.1e+03	2.2	1.6	1.5
google_gemma_2_27b_it	53.1	93.8	20.8	1.1e+03	2.2	1.9	1.2
qwen2-7b-instruct	52.5	96.8	20.5	1.1e+03	2.2	1.7	1.5
mistralai_ministral_8b_instruct_2410	49.3	97.4	19	1.1e+03	2.2	1.6	1.5
mistralai_mathstral_7b_v0.1	48.5	97.8	18.6	1.1e+03	2.2	1.6	1.5
llama-3.1-8B-instruct	48.4	93.8	18.7	1.1e+03	2.2	1.7	1.5
mistralai_mixtral_8x22b_instruct_v0.1	47.8	91.8	18.4	1e+02	2.2	1.6	1.5
qwen2.5-coder-3b-instruct	47.4	97.2	18.7	1.1e+03	2.2	1.5	1.7
google_gemma_2_9b_it	47.4	90.6	17.6	1.1e+03	2.2	1.9	1.2
llama-3.2-3B-instruct	44.5	91.8	16.7	1.1e+03	2.2	1.7	1.4
qwen1.5-72b-chat	40.6	93.4	14.9	3.9e+02	2.2	1.6	1.5
qwen1.5-32b-chat	39.9	82.2	14.6	30	2.2	1.6	1.5
qwen3-0.6b	33.4	95	12.1	1.1e+03	2.1	1.4	1.5
qwen2.5-coder-1.5b-instruct	33.1	95.6	11.7	1.1e+03	2.1	1.4	1.6
qwen1.5-14b-chat	30.9	95.2	10.4	1.1e+03	2.1	1.5	1.5
llama-3.2-1B-instruct	26.6	87.8	8.92	1.1e+03	2	1.4	1.4
google_codegemma_1.1_7b_it	21.9	91.2	7.04	1.1e+03	1.9	1.3	1.3
deepseek_v2_lite_chat	21.4	93.8	6.98	1.1e+03	1.8	1.1	1.4
qwen1.5-7b-chat	17	92.2	5.27	1.1e+03	1.7	1.1	1.3
google_gemma_3_1b_it	14.5	86.6	6.01	1.1e+03	1.6	1	1.2
mistralai_mistral_7b_instruct_v0.3	13.9	91.4	4.34	1.1e+03	1.5	0.97	1.2
mistralai_mistral_7b_instruct_v0.2	10.4	86	3.22	9.2e+02	1.4	0.79	1.1
qwen2.5-coder-0.5b-instruct	8.66	87.4	2.77	1.1e+03	1.3	0.68	1.1
qwen2-1.5b-instruct	7.18	89.4	2.17	1.1e+03	1.2	0.51	1
mistralai_mistral_7b_instruct_v0.1	6.04	83.6	1.88	1.1e+03	1.1	0.48	0.95
google_gemma_7b_it	5.78	64.8	1.72	1.1e+03	1	0.68	0.79
qwen2-0.5b-instruct	3.34	83.2	1.13	1.1e+03	0.8	0.28	0.75
qwen1.5-1.8b-chat	1.53	73.8	0.539	1.1e+03	0.55	0.14	0.53
qwen1.5-0.5b-chat	0.92	64.8	0.386	8.9e+02	0.43	0.089	0.42
google_gemma_2b_it	0.196	26.6	0.0683	1.1e+03	0.2	0.041	0.19