gre_physics_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen2-math-72b-instruct	78.3	43.7	11	4.8	3.7	3
qwen2-72b-instruct	77.3	43.2	11	4.8	4.6	1.6
google_gemma_3_27b_it	74.4	42.2	12	5	3.9	3.1
qwen3-8b	72	39.2	12	5.2	4.3	2.9
qwen3-4b	71.9	39.7	12	5.2	4.1	3.2
qwen3-14b	71.1	38.7	12	5.2	4.6	2.5
qwen3-32b	67.9	37.6	11	5.4	4.1	3.5
qwen2.5-coder-32b-instruct	62.4	33.2	11	5.6	4.3	3.6
qwen1.5-72b-chat	60.1	32.3	11	5.7	5.1	2.5
google_gemma_2_27b_it	59.2	32.3	10	5.7	3.6	4.4
qwen1.5-32b-chat	58.4	29.9	11	5.7	5.1	2.5
mistralai_mixtral_8x22b_instruct_v0.1	58.1	29.8	11	5.7	4.5	3.5
qwen2.5-coder-14b-instruct	56.2	28.6	12	5.7	4.7	3.3
qwen2.5-coder-7b-instruct	55.2	28.5	10	5.7	4.3	3.8
google_gemma_3_12b_it	54.3	28.6	11	5.8	4.4	3.7
qwen1.5-14b-chat	52.7	26.5	12	5.8	4.6	3.5
qwen2-math-7b-instruct	51.8	26.8	6	5.8	4.7	3.3
qwen3-1.7b	48.2	24.2	12	5.8	4.3	3.8
qwen1.5-7b-chat	46.3	24.1	12	5.8	4.8	3.2
qwen2-7b-instruct	46.1	23.2	11	5.8	4.6	3.5
mistralai_mathstral_7b_v0.1	45.6	23.3	11	5.8	4.1	4
mistralai_mistral_7b_instruct_v0.3	41	19.5	11	5.7	4.9	2.8
qwen3-0.6b	39.8	19.9	13	5.7	4.5	3.4
deepseek_v2_lite_chat	39.2	19.9	11	5.6	3.7	4.3
google_gemma_2_9b_it	38.8	20	11	5.6	3.8	4.1
mistralai_ministral_8b_instruct_2410	38.3	18.9	11	5.6	3.3	4.5
qwen2-math-1.5b-instruct	37.3	19.6	4	5.6	4	3.9
llama-3.2-3B-instruct	37.3	17.5	17	5.6	5.6	0
llama-3.1-8B-instruct	37.3	17.9	15	5.6	5.6	0
google_gemma_3_4b_it	37	19.2	13	5.6	4.4	3.4
qwen2.5-coder-3b-instruct	35.3	17.6	12	5.5	3.2	4.5
mistralai_mixtral_8x7b_instruct_v0.1	29.8	15.2	12	5.3	3.4	4
qwen2.5-coder-1.5b-instruct	25.6	13.8	11	5	1.3	4.9
google_gemma_7b_it	24.4	11.3	13	5	4.1	2.8
qwen2-1.5b-instruct	24.3	12.3	13	5	2.9	4
mistralai_mistral_7b_instruct_v0.2	24	10.8	10	4.9	3.6	3.4
google_codegemma_1.1_7b_it	23.4	11.4	13	4.9	3.3	3.6
mistralai_mistral_7b_instruct_v0.1	20	10.2	11	4.6	2.3	4
qwen1.5-0.5b-chat	19.5	11.4	13	4.6	1.2	4.4
llama-3.2-1B-instruct	18.7	9.57	12	4.5	4.5	0
qwen1.5-1.8b-chat	18.3	9.79	11	4.5	1.5	4.2
qwen2.5-coder-0.5b-instruct	18.3	11.1	13	4.5	1.1	4.3
deepseek_r1_distill_llama_70b	18.2	8.64	11	4.5	2.5	3.7
qwen2-0.5b-instruct	18.2	10.6	13	4.5	1.2	4.3
google_gemma_3_1b_it	17.9	11.2	13	4.4	3.2	3.1
google_gemma_2b_it	17.1	9.38	13	4.4	3.2	2.9
deepseek_r1_distill_qwen_14b	14.7	6.69	11	4.1	1.9	3.6
deepseek_r1_distill_qwen_32b	12.8	5.98	11	3.9	1.6	3.5
deepseek_r1_distill_llama_8b	11.3	4.77	13	3.7	1.8	3.2
deepseek_r1_distill_qwen_7b	6.79	3.09	11	2.9	0.9	2.8
deepseek_r1_distill_qwen_1.5b	4	1.7	13	2.3	0.49	2.2