ap_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	92	33.7	2	1	0.85	0.56
qwen3-14b	89.7	32.1	3	1.1	0.98	0.57
google_gemma_3_27b_it	88.7	31.6	3	1.2	0.99	0.65
qwen2-72b-instruct	88.7	31.6	2	1.2	0.97	0.67
llama-3.1-70B-instruct	87.8	31.2	4	1.2	1.2	0
deepseek_r1_distill_llama_70b	87.4	31	2	1.2	0.96	0.79
qwen3-8b	87.1	30.5	3	1.3	1	0.75
qwen2.5-coder-32b-instruct	85.8	29.6	2	1.3	1	0.81
google_gemma_3_12b_it	85	29	3	1.3	1.2	0.64
google_gemma_2_27b_it	83.9	28.7	2	1.4	1.1	0.8
qwen3-4b	83.4	28.8	4	1.4	1.2	0.74
deepseek_r1_distill_qwen_32b	82.8	28.6	2	1.4	1.1	0.9
deepseek_r1_distill_qwen_14b	82.4	27.9	4	1.4	1.1	0.9
qwen1.5-72b-chat	81.9	27.5	2	1.4	1.1	0.98
google_gemma_2_9b_it	80.1	26.4	3	1.5	1.2	0.84
qwen2.5-coder-14b-instruct	79.1	26.2	3	1.5	1.1	1.1
qwen1.5-32b-chat	79	26	2	1.5	1.2	0.96
mistralai_mixtral_8x22b_instruct_v0.1	77.4	27	2	1.6	1.1	1.2
qwen2-math-72b-instruct	75.1	25.2	2	1.6	0.99	1.3
mistralai_mixtral_8x7b_instruct_v0.1	72.1	23	3	1.7	1.3	1.1
qwen3-1.7b	71.4	23	4	1.7	1.4	0.99
qwen1.5-14b-chat	69.9	21.7	3	1.7	1.4	1
llama-3.1-8B-instruct	69.1	21.2	7	1.7	1.7	0
qwen2-7b-instruct	68.6	21.5	4	1.7	1.2	1.3
google_gemma_3_4b_it	68.3	21.4	5	1.7	1.4	1
qwen2.5-coder-7b-instruct	67	21	4	1.8	1.2	1.3
mistralai_ministral_8b_instruct_2410	66.5	20.6	3	1.8	1.1	1.4
deepseek_r1_distill_llama_8b	60.7	19	4	1.8	1.1	1.5
mistralai_mistral_7b_instruct_v0.3	59.9	18.1	4	1.8	1.3	1.3
qwen1.5-7b-chat	59.8	17.8	3	1.8	1.4	1.2
qwen2-math-7b-instruct	58.7	18.9	4	1.8	1.2	1.4
llama-3.2-3B-instruct	58.5	17.6	10	1.8	1.8	0
deepseek_v2_lite_chat	54.6	16.3	2	1.9	1.3	1.3
mistralai_mathstral_7b_v0.1	52.9	15.9	4	1.9	1.1	1.5
google_codegemma_1.1_7b_it	52.3	15.5	5	1.9	1.4	1.3
mistralai_mistral_7b_instruct_v0.2	52.2	15.4	4	1.9	1.3	1.3
qwen2.5-coder-3b-instruct	51.8	15.9	4	1.9	1.1	1.5
deepseek_r1_distill_qwen_7b	49.4	14.8	4	1.9	1.2	1.5
google_gemma_7b_it	47.9	14.4	4	1.9	1.4	1.3
qwen3-0.6b	47.2	14.4	5	1.9	1.3	1.4
mistralai_mistral_7b_instruct_v0.1	41.7	12	4	1.8	1.1	1.5
google_gemma_3_1b_it	34.8	10.1	4	1.8	1.2	1.3
google_gemma_2b_it	32.2	9.9	4	1.8	1.4	0.99
qwen2-1.5b-instruct	31.2	9.15	4	1.7	0.99	1.4
qwen2.5-coder-1.5b-instruct	31.1	9.98	4	1.7	0.69	1.6
llama-3.2-1B-instruct	29.3	8.74	13	1.7	1.7	0
qwen1.5-1.8b-chat	24	7.02	3	1.6	0.81	1.4
deepseek_r1_distill_qwen_1.5b	21.8	6.34	4	1.5	0.73	1.4
qwen2-math-1.5b-instruct	21.8	7.79	4	1.5	0.71	1.4
qwen2-0.5b-instruct	18.5	6.14	5	1.5	0.49	1.4
qwen2.5-coder-0.5b-instruct	18.1	6.47	5	1.4	0.35	1.4
qwen1.5-0.5b-chat	15.1	5.34	5	1.3	0.5	1.2