lsat_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	80.1	33.7	10	2	1.5	1.3
qwen3-14b	77	31.6	11	2.1	1.7	1.2
llama-3.1-70B-instruct	75.2	30.2	13	2.2	2.2	0
deepseek_r1_distill_llama_70b	74.9	29.9	9	2.2	1.8	1.2
qwen2-72b-instruct	73.6	29.4	8	2.2	2	0.82
qwen3-8b	72.2	28.9	11	2.2	1.7	1.4
deepseek_r1_distill_qwen_32b	71.3	28	9	2.3	1.8	1.4
qwen2.5-coder-32b-instruct	70.4	27.3	10	2.3	1.8	1.4
qwen3-4b	70	27.9	12	2.3	1.8	1.4
qwen1.5-72b-chat	67.5	25.2	1	2.3	NaN	NaN
google_gemma_3_12b_it	67	25.5	10	2.3	2	1.3
mistralai_mixtral_8x22b_instruct_v0.1	66.8	25	11	2.3	2	1.2
deepseek_r1_distill_qwen_14b	66.7	25.5	11	2.3	1.8	1.5
qwen2-math-72b-instruct	66.7	25.1	7	2.3	1.9	1.3
qwen1.5-32b-chat	66.5	24.5	1	2.4	NaN	NaN
qwen2.5-coder-14b-instruct	63.2	22.8	11	2.4	2	1.3
qwen2-7b-instruct	58.3	20.6	12	2.5	2.1	1.2
qwen1.5-14b-chat	57.5	20	11	2.5	2.2	1.1
mistralai_mixtral_8x7b_instruct_v0.1	57.4	20	11	2.5	2.3	0.99
qwen2.5-coder-7b-instruct	56.2	19.7	12	2.5	2.1	1.4
llama-3.1-8B-instruct	53.6	18.2	15	2.5	2.5	0
qwen3-1.7b	51.2	18.2	12	2.5	1.9	1.6
mistralai_ministral_8b_instruct_2410	51	17.4	12	2.5	1.8	1.7
qwen1.5-7b-chat	50.5	16.8	12	2.5	2.2	1.2
mistralai_mathstral_7b_v0.1	49.4	16.7	11	2.5	1.9	1.6
google_gemma_3_4b_it	48.6	17	13	2.5	2	1.5
mistralai_mistral_7b_instruct_v0.3	48	15.2	11	2.5	2.2	1.2
mistralai_mistral_7b_instruct_v0.2	47.6	16.1	11	2.5	2.4	0.72
deepseek_v2_lite_chat	47.5	16.4	11	2.5	2	1.5
llama-3.2-3B-instruct	46.9	16	15	2.5	2.5	0
deepseek_r1_distill_llama_8b	45.1	15.4	12	2.5	1.6	1.9
deepseek_r1_distill_qwen_7b	44.1	15.3	11	2.5	1.7	1.8
qwen2-math-7b-instruct	43.8	15.3	12	2.5	1.9	1.5
qwen2.5-coder-3b-instruct	40.2	14	12	2.4	1.5	1.9
mistralai_mistral_7b_instruct_v0.1	37.4	12.7	11	2.4	1.5	1.8
qwen3-0.6b	37	13.7	13	2.4	2	1.4
qwen2-1.5b-instruct	34.6	11.7	12	2.4	1.7	1.7
qwen1.5-1.8b-chat	27.6	10.3	12	2.2	1.4	1.7
qwen2.5-coder-1.5b-instruct	27.4	10.3	12	2.2	1.1	2
llama-3.2-1B-instruct	26.3	9.65	11	2.2	2.2	0
qwen2-math-1.5b-instruct	24.3	10.5	11	2.1	1.1	1.8
deepseek_r1_distill_qwen_1.5b	24.2	9.21	12	2.1	0.98	1.9
qwen2-0.5b-instruct	22.7	10	13	2.1	0.84	1.9
qwen1.5-0.5b-chat	21.7	9.62	13	2.1	0.7	1.9
qwen2.5-coder-0.5b-instruct	20.5	9.33	13	2	0.33	2