gpqa_cot: by models

Home

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	49.4	83	31.9	11	2.4	1.7	1.6
llama-3.1-70B-instruct	43.8	43.8	27.6	13	2.3	2.3	0
google_gemma_3_27b_it	43.4	71	27.5	7	2.3	1.7	1.6
qwen3-14b	41.7	67.2	25.7	11	2.3	1.9	1.4
qwen2-math-72b-instruct	40.6	80.4	25.4	10	2.3	1.5	1.7
qwen3-8b	40.1	66.3	24.8	11	2.3	1.8	1.4
qwen2.5-coder-32b-instruct	39.5	72.3	24.3	10	2.3	1.7	1.6
qwen2-72b-instruct	39	70.1	23.5	10	2.3	1.8	1.5
google_gemma_3_12b_it	37.4	76.8	23.4	12	2.3	1.6	1.6
qwen3-4b	36.7	63.2	22.5	12	2.3	1.8	1.3
qwen2.5-coder-14b-instruct	34.6	82.1	21.3	11	2.2	1.3	1.8
qwen1.5-32b-chat	32.5	75.9	19.7	11	2.2	1.4	1.7
mistralai_mixtral_8x22b_instruct_v0.1	32.2	74.6	19.6	11	2.2	1.4	1.7
llama-3.1-8B-instruct	31.7	31.7	19.9	15	2.2	2.2	0
qwen3-1.7b	31.6	66.7	20.2	12	2.2	1.6	1.5
qwen1.5-72b-chat	31.5	67.4	19	10	2.2	1.5	1.6
deepseek_r1_distill_qwen_32b	31.5	53.6	19	10	2.2	1.8	1.3
qwen2.5-coder-7b-instruct	30.9	88.2	19.5	11	2.2	1.1	1.9
mistralai_mathstral_7b_v0.1	30.8	81.2	19	11	2.2	1.2	1.8
mistralai_ministral_8b_instruct_2410	30.8	81.5	19.3	12	2.2	1.2	1.9
google_gemma_2_9b_it	30.7	65.6	18.1	12	2.2	1.6	1.5
google_gemma_3_4b_it	30.4	72.8	19.2	13	2.2	1.5	1.6
mistralai_mixtral_8x7b_instruct_v0.1	29.8	74.1	18	11	2.2	1.3	1.7
llama-3.2-3B-instruct	29.7	29.7	18.7	17	2.2	2.2	0
deepseek_r1_distill_qwen_14b	28.6	51.6	17	11	2.1	1.7	1.3
qwen1.5-14b-chat	28.6	68.8	17.6	11	2.1	1.4	1.6
qwen2-math-7b-instruct	28.5	81.7	18.3	12	2.1	1.1	1.9
google_codegemma_1.1_7b_it	28.1	76.6	18.2	13	2.1	1.2	1.7
qwen2-7b-instruct	27.7	76.3	17.1	11	2.1	1.2	1.7
mistralai_mistral_7b_instruct_v0.3	27.4	73.7	16.8	11	2.1	1.3	1.7
qwen2.5-coder-3b-instruct	27	89.1	17.7	12	2.1	0.75	2
google_gemma_7b_it	26.7	60.9	17.1	12	2.1	1.5	1.5
deepseek_r1_distill_llama_70b	26.3	44.9	15.3	10	2.1	1.7	1.2
mistralai_mistral_7b_instruct_v0.1	25.2	83.7	16.7	11	2.1	0.85	1.9
google_gemma_3_1b_it	24.6	73.7	16.7	12	2	1	1.7
deepseek_v2_lite_chat	24.5	78.1	15.8	11	2	0.99	1.8
deepseek_r1_distill_qwen_7b	24.4	50.2	14.3	11	2	1.5	1.3
mistralai_mistral_7b_instruct_v0.2	24.1	70.8	14.8	11	2	1.2	1.7
deepseek_r1_distill_llama_8b	23.8	52.5	13.6	12	2	1.5	1.4
qwen3-0.6b	23.6	68.3	15.6	13	2	1.2	1.6
qwen2-math-1.5b-instruct	23.1	81.5	15.3	11	2	0.79	1.8
qwen1.5-7b-chat	20.4	72.5	12.9	12	1.9	0.85	1.7
qwen2.5-coder-1.5b-instruct	19.1	88.8	13.1	12	1.9	0.34	1.8
qwen2.5-coder-0.5b-instruct	18	88.2	13	13	1.8	0.36	1.8
llama-3.2-1B-instruct	15.4	15.4	10.8	12	1.7	1.7	0
qwen2-1.5b-instruct	12.5	70.3	8.35	12	1.6	0.36	1.5
google_gemma_2b_it	12.1	50	8.34	12	1.5	0.72	1.4
qwen2-0.5b-instruct	12	75	8.48	13	1.5	0.28	1.5
deepseek_r1_distill_qwen_1.5b	10.9	40.2	6.08	12	1.5	0.81	1.2
qwen1.5-0.5b-chat	8.6	63.4	6.18	13	1.3	0.2	1.3
qwen1.5-1.8b-chat	7.92	54.5	5.37	12	1.3	0.29	1.2