ap_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	92	94.2	33.7	2	1	0.85	0.56
qwen3-14b	89.7	92.8	32.1	3	1.1	0.98	0.57
google_gemma_3_27b_it	88.7	92.5	31.6	3	1.2	0.99	0.65
qwen2-72b-instruct	88.7	92	31.6	2	1.2	0.97	0.67
llama-3.1-70B-instruct	87.8	87.8	31.2	4	1.2	1.2	0
deepseek_r1_distill_llama_70b	87.4	91.8	31	2	1.2	0.96	0.79
qwen3-8b	87.1	93	30.5	3	1.3	1	0.75
qwen2.5-coder-32b-instruct	85.8	90.4	29.6	2	1.3	1	0.81
google_gemma_3_12b_it	85	89.3	29	3	1.3	1.2	0.64
google_gemma_2_27b_it	83.9	88.5	28.7	2	1.4	1.1	0.8
qwen3-4b	83.4	90.3	28.8	4	1.4	1.2	0.74
deepseek_r1_distill_qwen_32b	82.8	88.6	28.6	2	1.4	1.1	0.9
deepseek_r1_distill_qwen_14b	82.4	91.8	27.9	4	1.4	1.1	0.9
qwen1.5-72b-chat	81.9	88.7	27.5	2	1.4	1.1	0.98
google_gemma_2_9b_it	80.1	87.1	26.4	3	1.5	1.2	0.84
qwen2.5-coder-14b-instruct	79.1	91	26.2	3	1.5	1.1	1.1
qwen1.5-32b-chat	79	85.7	26	2	1.5	1.2	0.96
mistralai_mixtral_8x22b_instruct_v0.1	77.4	87.1	27	2	1.6	1.1	1.2
qwen2-math-72b-instruct	75.1	86.8	25.2	2	1.6	0.99	1.3
mistralai_mixtral_8x7b_instruct_v0.1	72.1	84.7	23	3	1.7	1.3	1.1
qwen3-1.7b	71.4	83.8	23	4	1.7	1.4	0.99
qwen1.5-14b-chat	69.9	80.3	21.7	3	1.7	1.4	1
llama-3.1-8B-instruct	69.1	69.1	21.2	7	1.7	1.7	0
qwen2-7b-instruct	68.6	88.6	21.5	4	1.7	1.2	1.3
google_gemma_3_4b_it	68.3	83.3	21.4	5	1.7	1.4	1
qwen2.5-coder-7b-instruct	67	87.5	21	4	1.8	1.2	1.3
mistralai_ministral_8b_instruct_2410	66.5	85.1	20.6	3	1.8	1.1	1.4
deepseek_r1_distill_llama_8b	60.7	85.1	19	4	1.8	1.1	1.5
mistralai_mistral_7b_instruct_v0.3	59.9	79	18.1	4	1.8	1.3	1.3
qwen1.5-7b-chat	59.8	76.7	17.8	3	1.8	1.4	1.2
qwen2-math-7b-instruct	58.7	85	18.9	4	1.8	1.2	1.4
llama-3.2-3B-instruct	58.5	58.5	17.6	10	1.8	1.8	0
deepseek_v2_lite_chat	54.6	66.9	16.3	2	1.9	1.3	1.3
mistralai_mathstral_7b_v0.1	52.9	81.9	15.9	4	1.9	1.1	1.5
google_codegemma_1.1_7b_it	52.3	76.2	15.5	5	1.9	1.4	1.3
mistralai_mistral_7b_instruct_v0.2	52.2	74	15.4	4	1.9	1.3	1.3
qwen2.5-coder-3b-instruct	51.8	82.1	15.9	4	1.9	1.1	1.5
deepseek_r1_distill_qwen_7b	49.4	76.4	14.8	4	1.9	1.2	1.5
google_gemma_7b_it	47.9	67.7	14.4	4	1.9	1.4	1.3
qwen3-0.6b	47.2	76.8	14.4	5	1.9	1.3	1.4
mistralai_mistral_7b_instruct_v0.1	41.7	72	12	4	1.8	1.1	1.5
google_gemma_3_1b_it	34.8	58.9	10.1	4	1.8	1.2	1.3
google_gemma_2b_it	32.2	45.4	9.9	4	1.8	1.4	0.99
qwen2-1.5b-instruct	31.2	62.6	9.15	4	1.7	0.99	1.4
qwen2.5-coder-1.5b-instruct	31.1	68.8	9.98	4	1.7	0.69	1.6
llama-3.2-1B-instruct	29.3	29.3	8.74	13	1.7	1.7	0
qwen1.5-1.8b-chat	24	46.6	7.02	3	1.6	0.81	1.4
deepseek_r1_distill_qwen_1.5b	21.8	50.4	6.34	4	1.5	0.73	1.4
qwen2-math-1.5b-instruct	21.8	52.3	7.79	4	1.5	0.71	1.4
qwen2-0.5b-instruct	18.5	56	6.14	5	1.5	0.49	1.4
qwen2.5-coder-0.5b-instruct	18.1	58.6	6.47	5	1.4	0.35	1.4
qwen1.5-0.5b-chat	15.1	48.5	5.34	5	1.3	0.5	1.2