math500_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
google_gemma_3_27b_it	85.3	42.2	11	1.6	1.3	0.92
deepseek_r1_distill_qwen_32b	83.9	41.3	10	1.6	1.2	1.1
qwen3-32b	81.9	39.4	9	1.7	1.3	1.1
deepseek_r1_distill_llama_70b	81.1	39.4	10	1.7	1.3	1.2
qwen3-14b	81	38.7	11	1.8	1.4	1.1
qwen2-math-72b-instruct	80.4	38.1	7	1.8	1.5	1
qwen3-8b	79.2	37.3	11	1.8	1.5	1.1
google_gemma_3_12b_it	79	37.3	12	1.8	1.5	1
deepseek_r1_distill_qwen_14b	78.9	37.9	12	1.8	1.3	1.3
qwen3-4b	77.2	36	12	1.9	1.5	1.1
deepseek_r1_distill_qwen_7b	77.1	36.5	12	1.9	1.4	1.3
qwen2.5-coder-32b-instruct	76.1	35	10	1.9	1.5	1.2
qwen2.5-coder-14b-instruct	71.3	31.7	11	2	1.6	1.2
deepseek_r1_distill_llama_8b	70.4	32.7	12	2	1.5	1.4
qwen2-math-1.5b-instruct	68	29.7	1	2.1	NaN	NaN
deepseek_r1_distill_qwen_1.5b	67.6	30.1	12	2.1	1.5	1.4
qwen3-1.7b	66.7	29.1	12	2.1	1.6	1.3
llama-3.1-70B-instruct	65.6	28.7	12	2.1	2.1	0.21
google_gemma_3_4b_it	65.1	29.1	12	2.1	1.7	1.3
qwen2-72b-instruct	64.5	27.2	10	2.1	1.7	1.3
qwen2.5-coder-7b-instruct	61	25.7	12	2.2	1.6	1.5
google_gemma_2_27b_it	52.3	20.4	10	2.2	1.9	1.2
qwen2-7b-instruct	51.7	20.1	12	2.2	1.7	1.5
mistralai_ministral_8b_instruct_2410	49.3	18.9	12	2.2	1.7	1.5
mistralai_mixtral_8x22b_instruct_v0.1	48.1	18.4	9	2.2	1.7	1.5
llama-3.1-8B-instruct	47.6	18	13	2.2	2	1.1
mistralai_mathstral_7b_v0.1	47	17.7	12	2.2	1.7	1.5
qwen2.5-coder-3b-instruct	46.9	18.4	11	2.2	1.5	1.6
google_gemma_2_9b_it	46.4	16.9	11	2.2	1.9	1.2
llama-3.2-3B-instruct	44.1	16.6	18	2.2	1.9	1.1
qwen1.5-72b-chat	40.3	14.7	10	2.2	1.6	1.5
qwen1.5-32b-chat	38.9	14.1	9	2.2	1.6	1.5
qwen3-0.6b	33.7	12.2	13	2.1	1.5	1.5
qwen2.5-coder-1.5b-instruct	32.8	11.3	12	2.1	1.5	1.5
qwen1.5-14b-chat	30.5	10.3	11	2.1	1.5	1.4
llama-3.2-1B-instruct	27.9	9.67	11	2	2	0.23
mistralai_mixtral_8x7b_instruct_v0.1	25.2	8.44	11	1.9	1.3	1.4
deepseek_v2_lite_chat	22.5	7.25	9	1.9	1.2	1.4
google_codegemma_1.1_7b_it	20.8	6.55	12	1.8	1.3	1.3
qwen1.5-7b-chat	16.5	5.19	11	1.7	1.1	1.3
google_gemma_3_1b_it	14.5	5.93	12	1.6	1	1.2
mistralai_mistral_7b_instruct_v0.3	12.9	3.86	12	1.5	0.95	1.2
mistralai_mistral_7b_instruct_v0.2	10.1	2.98	12	1.3	0.82	1.1
qwen2.5-coder-0.5b-instruct	8.03	2.48	13	1.2	0.7	1
qwen2-1.5b-instruct	6.8	1.96	12	1.1	0.53	0.99
mistralai_mistral_7b_instruct_v0.1	6.4	1.99	12	1.1	0.58	0.93
google_gemma_7b_it	5.77	1.68	12	1	0.68	0.79
qwen2-0.5b-instruct	2.91	0.985	13	0.75	0.31	0.69
qwen1.5-1.8b-chat	1.42	0.434	11	0.53	0.16	0.5
qwen1.5-0.5b-chat	0.8	0.307	12	0.4	0.11	0.38
google_gemma_2b_it	0.117	0.047	12	0.15	0	0.15