gre_physics_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen2-72b-instruct	79.3	46.4	2	4.7	4.4	1.6
google_gemma_3_27b_it	72	42	3	5.2	4.1	3.2
qwen2-math-72b-instruct	71.3	41.2	2	5.2	2.9	4.3
qwen3-8b	70.2	40.1	3	5.3	4.1	3.3
qwen3-4b	70	39.9	4	5.3	3.8	3.7
qwen3-14b	69.3	39	3	5.3	4.6	2.7
qwen3-32b	65.8	37.1	3	5.5	3.7	4
qwen2.5-coder-32b-instruct	62	34.8	2	5.6	4.3	3.7
google_gemma_2_27b_it	58	31.9	2	5.7	3.9	4.1
google_gemma_3_12b_it	57.3	31.9	4	5.7	4.1	3.9
qwen2.5-coder-14b-instruct	56.9	30.7	3	5.7	4.3	3.8
mistralai_mixtral_8x22b_instruct_v0.1	56	30.6	3	5.7	4	4.1
qwen1.5-72b-chat	55.3	30.3	2	5.7	4.6	3.4
qwen1.5-32b-chat	55.1	29.4	3	5.7	4.8	3.1
qwen1.5-14b-chat	50.7	27.2	3	5.8	4.3	3.8
qwen2.5-coder-7b-instruct	48	26.4	4	5.8	3.4	4.7
qwen2-math-7b-instruct	47.3	25.5	2	5.8	4.3	3.9
qwen2-7b-instruct	46	24.6	4	5.8	4.3	3.8
qwen3-1.7b	44.3	23	4	5.7	4.4	3.7
google_gemma_2_9b_it	44	24.6	4	5.7	3.4	4.6
qwen1.5-7b-chat	43.1	23.6	3	5.7	4.2	3.9
mistralai_mathstral_7b_v0.1	39.3	20.7	4	5.6	3.7	4.3
llama-3.1-8B-instruct	38.7	20.8	7	5.6	5.6	0
deepseek_v2_lite_chat	37.8	21.1	3	5.6	3.4	4.4
mistralai_mistral_7b_instruct_v0.3	37.7	18.9	4	5.6	4.4	3.4
qwen3-0.6b	37.3	20.3	5	5.6	3.6	4.2
mistralai_ministral_8b_instruct_2410	36.7	19.8	4	5.6	2.5	5
google_gemma_3_4b_it	36	19.6	5	5.5	4.2	3.6
llama-3.2-3B-instruct	32	16	10	5.4	5.4	0
qwen2.5-coder-3b-instruct	31.7	16.5	4	5.4	2.9	4.5
qwen2-math-1.5b-instruct	29.8	16.2	3	5.3	3.3	4.1
mistralai_mixtral_8x7b_instruct_v0.1	27.6	15.3	3	5.2	3	4.2
google_codegemma_1.1_7b_it	26.1	14.2	5	5.1	2.8	4.2
qwen2.5-coder-1.5b-instruct	25	14.3	4	5	0	5
google_gemma_7b_it	23.7	11.9	4	4.9	3.8	3.1
mistralai_mistral_7b_instruct_v0.2	23.3	11.5	4	4.9	3.2	3.7
qwen2-1.5b-instruct	21.7	11.7	4	4.8	2	4.3
mistralai_mistral_7b_instruct_v0.1	20	10.5	4	4.6	1.9	4.2
llama-3.2-1B-instruct	18.7	10.1	13	4.5	4.5	0
google_gemma_2b_it	17.7	10.7	4	4.4	2.9	3.3
qwen1.5-0.5b-chat	17.3	11.1	5	4.4	1.2	4.2
deepseek_r1_distill_llama_70b	16.7	8.03	2	4.3	3	3.1
qwen2.5-coder-0.5b-instruct	16.3	10.6	5	4.3	0.17	4.3
qwen2-0.5b-instruct	16	9.53	5	4.2	0.57	4.2
google_gemma_3_1b_it	16	10.2	4	4.2	2.7	3.3
qwen1.5-1.8b-chat	13.8	7.67	3	4	1	3.8
deepseek_r1_distill_qwen_32b	12.7	6.6	2	3.8	1.2	3.7
deepseek_r1_distill_qwen_14b	8.67	4.49	4	3.2	0.69	3.2
deepseek_r1_distill_qwen_7b	7.33	3.17	4	3	1.4	2.7
deepseek_r1_distill_llama_8b	7	3.19	4	2.9	0.91	2.8
deepseek_r1_distill_qwen_1.5b	4.67	2.2	4	2.4	0.77	2.3