gpqa_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	50.2	33.6	2	2.4	1.7	1.7
google_gemma_3_27b_it	47.3	31.8	1	2.4	NaN	NaN
llama-3.1-70B-instruct	45.3	30.4	4	2.4	2.4	0
qwen3-14b	43.2	27.9	3	2.3	1.8	1.5
qwen2.5-coder-32b-instruct	39.7	25.5	2	2.3	1.6	1.7
qwen3-8b	39	24.8	3	2.3	1.8	1.5
qwen2-72b-instruct	36.5	23	2	2.3	1.6	1.6
qwen3-4b	35.8	22.7	4	2.3	1.8	1.3
qwen2-math-72b-instruct	35.3	22.7	2	2.3	1.1	2
google_gemma_3_12b_it	35	22.6	4	2.3	1.6	1.6
qwen1.5-32b-chat	32.3	20.2	2	2.2	1.4	1.7
qwen3-1.7b	32.1	21.1	4	2.2	1.6	1.5
qwen2.5-coder-14b-instruct	31.8	20	3	2.2	1.3	1.8
llama-3.1-8B-instruct	31.7	20.4	7	2.2	2.2	0
mistralai_mixtral_8x22b_instruct_v0.1	31.6	19.9	2	2.2	1.2	1.8
google_gemma_2_9b_it	30.6	18.8	3	2.2	1.6	1.5
mistralai_mixtral_8x7b_instruct_v0.1	30.4	19.5	3	2.2	1.3	1.8
qwen1.5-72b-chat	30.2	19	2	2.2	1.4	1.6
google_gemma_3_4b_it	30	19.6	5	2.2	1.5	1.6
mistralai_mathstral_7b_v0.1	29.7	19.3	4	2.2	0.99	1.9
qwen2-math-7b-instruct	27.9	18.9	4	2.1	0.87	1.9
qwen1.5-14b-chat	27.7	17.8	3	2.1	1.3	1.7
google_codegemma_1.1_7b_it	27.1	18.3	5	2.1	1	1.8
qwen2.5-coder-7b-instruct	26.7	17.5	4	2.1	0.9	1.9
mistralai_ministral_8b_instruct_2410	26.5	17.2	4	2.1	0.97	1.8
mistralai_mistral_7b_instruct_v0.3	26.5	16.8	4	2.1	1.2	1.7
deepseek_r1_distill_qwen_32b	26.3	15.9	2	2.1	1.7	1.3
llama-3.2-3B-instruct	26.1	17.4	10	2.1	2.1	0
google_gemma_7b_it	26.1	17.4	4	2.1	1.4	1.5
qwen2.5-coder-3b-instruct	25.7	17.8	4	2.1	0.53	2
deepseek_v2_lite_chat	24.6	16.5	2	2	0.89	1.8
google_gemma_3_1b_it	24.4	17.1	4	2	0.99	1.8
qwen2-7b-instruct	24.2	15.2	4	2	1.2	1.7
qwen3-0.6b	23.5	16.2	5	2	1.1	1.7
deepseek_r1_distill_qwen_14b	23.2	13.7	4	2	1.6	1.2
qwen1.5-7b-chat	22.9	15.2	3	2	0.82	1.8
mistralai_mistral_7b_instruct_v0.2	21.7	14.1	4	1.9	0.97	1.7
mistralai_mistral_7b_instruct_v0.1	21.3	14.8	4	1.9	0.58	1.8
deepseek_r1_distill_llama_70b	20.4	12	2	1.9	1.6	1.1
deepseek_r1_distill_llama_8b	19.8	11.5	4	1.9	1.4	1.2
deepseek_r1_distill_qwen_7b	19.7	11.5	4	1.9	1.4	1.2
qwen2-math-1.5b-instruct	17.3	12.1	4	1.8	0.54	1.7
qwen2.5-coder-0.5b-instruct	16.2	11.8	5	1.7	0	1.7
qwen2.5-coder-1.5b-instruct	16.2	11.5	4	1.7	0.22	1.7
google_gemma_2b_it	12.3	8.69	4	1.6	0.67	1.4
qwen2-1.5b-instruct	11	7.72	4	1.5	0.38	1.4
llama-3.2-1B-instruct	10.9	7.96	13	1.5	1.5	0
deepseek_r1_distill_qwen_1.5b	9.6	5.59	4	1.4	0.8	1.1
qwen1.5-1.8b-chat	8.04	5.81	3	1.3	0	1.3
qwen2-0.5b-instruct	7.81	5.68	5	1.3	0	1.3
qwen1.5-0.5b-chat	1.88	1.37	5	0.64	0.11	0.63