ap_cot: by models

Home

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	92.5	97.7	31.9	11	0.99	0.81	0.56
qwen3-14b	90.3	95.6	30.3	11	1.1	0.96	0.56
qwen2-72b-instruct	89	95.6	29.6	9	1.2	0.96	0.68
llama-3.1-70B-instruct	88.7	88.7	30	11	1.2	1.2	0
google_gemma_3_27b_it	88.6	94.1	29.4	10	1.2	1	0.62
deepseek_r1_distill_llama_70b	88.6	95.9	29.8	9	1.2	0.95	0.72
deepseek_r1_distill_qwen_32b	87.2	96.5	28.9	9	1.3	0.96	0.8
qwen3-8b	86.8	95.8	28.3	11	1.3	1	0.72
qwen2.5-coder-32b-instruct	86.4	95.5	27.9	9	1.3	1	0.77
deepseek_r1_distill_qwen_14b	85.3	95.5	27.6	12	1.3	1	0.83
google_gemma_3_12b_it	85.2	93.1	27.2	11	1.3	1.1	0.7
qwen3-4b	83.3	92.3	26.7	11	1.4	1.2	0.72
google_gemma_2_27b_it	83.1	93	26.1	10	1.4	1.2	0.76
qwen1.5-72b-chat	81.5	93.8	25.4	9	1.5	1.1	0.91
google_gemma_2_9b_it	80.9	92.4	25	12	1.5	1.2	0.81
qwen2-math-72b-instruct	80.5	94.8	25.3	9	1.5	1.1	0.99
qwen2.5-coder-14b-instruct	80.5	94.1	24.8	11	1.5	1.2	0.94
qwen1.5-32b-chat	79.1	94.1	24.3	11	1.5	1.2	0.97
mistralai_mixtral_8x22b_instruct_v0.1	76.8	94.4	24.9	11	1.6	1.1	1.2
llama-3.1-8B-instruct	73.4	73.4	21.8	15	1.7	1.7	0
qwen2-7b-instruct	72.6	93.7	21.4	12	1.7	1.2	1.1
qwen1.5-14b-chat	71.8	89	20.7	11	1.7	1.4	0.98
mistralai_mixtral_8x7b_instruct_v0.1	71	92.4	20.9	11	1.7	1.3	1.1
qwen2.5-coder-7b-instruct	70.9	94.7	20.7	12	1.7	1.2	1.2
qwen3-1.7b	70.8	87.6	20.9	11	1.7	1.4	0.99
mistralai_ministral_8b_instruct_2410	69.2	95.8	20.1	12	1.7	1.2	1.3
google_gemma_3_4b_it	68.5	86.5	19.7	13	1.7	1.5	0.96
deepseek_r1_distill_llama_8b	63.2	92.5	18.2	12	1.8	1.2	1.4
qwen2-math-7b-instruct	61.9	92.7	18.6	11	1.8	1.2	1.3
qwen1.5-7b-chat	61.5	90	17.1	12	1.8	1.4	1.2
llama-3.2-3B-instruct	60.3	60.3	16.4	18	1.8	1.8	0
mistralai_mistral_7b_instruct_v0.3	59.7	86.9	16.5	12	1.8	1.4	1.2
mistralai_mathstral_7b_v0.1	57.6	92.3	15.9	12	1.9	1.2	1.4
deepseek_v2_lite_chat	57.2	90.3	15.8	11	1.9	1.3	1.3
qwen2.5-coder-3b-instruct	55.3	93	15.2	11	1.9	1.2	1.4
deepseek_r1_distill_qwen_7b	54.4	88.6	15.5	12	1.9	1.2	1.4
google_codegemma_1.1_7b_it	52.9	83.4	14.2	13	1.9	1.4	1.2
mistralai_mistral_7b_instruct_v0.2	52.3	80.7	14.2	12	1.9	1.4	1.3
mistralai_mistral_7b_instruct_v0.1	49	88.9	13.1	12	1.9	1.2	1.4
google_gemma_7b_it	48.1	77.4	13.1	12	1.9	1.4	1.2
qwen3-0.6b	47.2	82.3	12.9	12	1.9	1.4	1.3
qwen2-1.5b-instruct	39.2	84.1	10.4	11	1.8	1.2	1.4
google_gemma_3_1b_it	35.4	71.3	9.39	12	1.8	1.3	1.2
qwen2-math-1.5b-instruct	34.3	85.7	10.7	11	1.8	0.95	1.5
qwen2.5-coder-1.5b-instruct	33.9	88.6	9.75	11	1.8	0.81	1.6
google_gemma_2b_it	32.6	50.9	9.28	12	1.8	1.5	0.88
llama-3.2-1B-instruct	31.9	31.9	8.67	21	1.7	1.7	0
qwen1.5-1.8b-chat	30.5	79.5	8.29	12	1.7	0.98	1.4
deepseek_r1_distill_qwen_1.5b	26.5	77.6	7.34	12	1.7	0.81	1.4
qwen2-0.5b-instruct	22.3	78.1	7.08	13	1.6	0.79	1.3
qwen2.5-coder-0.5b-instruct	19.7	85.9	6.82	12	1.5	0.39	1.4
qwen1.5-0.5b-chat	19.7	73.4	6.23	12	1.5	0.68	1.3