lsat_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	78.5	35.4	2	2	1.5	1.3
qwen3-14b	75.2	32.8	3	2.2	1.8	1.2
qwen2-72b-instruct	74.7	33	1	2.2	NaN	NaN
llama-3.1-70B-instruct	73.9	32.6	4	2.2	2.2	0
qwen3-8b	72.1	31	3	2.2	1.7	1.4
deepseek_r1_distill_llama_70b	72	30.5	2	2.2	2	1.1
qwen3-4b	69.9	29.9	4	2.3	1.8	1.3
deepseek_r1_distill_qwen_32b	68.7	28.8	2	2.3	1.9	1.3
qwen2.5-coder-32b-instruct	68.6	28.8	2	2.3	1.9	1.4
google_gemma_3_12b_it	68.2	28.5	2	2.3	1.9	1.3
deepseek_r1_distill_qwen_14b	64.1	25.9	4	2.4	1.8	1.5
mistralai_mixtral_8x22b_instruct_v0.1	63.8	25.7	2	2.4	2	1.4
qwen2.5-coder-14b-instruct	61	24.3	3	2.4	1.9	1.5
qwen1.5-14b-chat	57.6	22.4	3	2.5	2.1	1.2
mistralai_mixtral_8x7b_instruct_v0.1	57.5	22.1	3	2.5	2.2	1
qwen2-7b-instruct	56.5	21.9	4	2.5	2	1.4
qwen2.5-coder-7b-instruct	53.4	20.3	4	2.5	2	1.5
llama-3.1-8B-instruct	51.9	20	7	2.5	2.5	0
qwen3-1.7b	50.7	20.1	4	2.5	1.8	1.7
qwen1.5-7b-chat	49.1	18.3	3	2.5	2.1	1.4
mistralai_mistral_7b_instruct_v0.3	48.9	17.8	4	2.5	2.1	1.3
google_gemma_3_4b_it	47.3	18.1	5	2.5	2	1.5
mistralai_mistral_7b_instruct_v0.2	47	17.5	4	2.5	2.3	0.92
mistralai_mathstral_7b_v0.1	46.5	18.1	4	2.5	1.7	1.8
mistralai_ministral_8b_instruct_2410	46	17.5	4	2.5	1.5	2
deepseek_v2_lite_chat	44.5	17.1	2	2.5	1.8	1.7
deepseek_r1_distill_llama_8b	43.9	16.3	4	2.5	1.5	1.9
llama-3.2-3B-instruct	43.9	16.7	8	2.5	2.5	0
deepseek_r1_distill_qwen_7b	41.4	15.7	4	2.5	1.5	1.9
qwen2-math-7b-instruct	40.5	15.9	3	2.4	1.7	1.8
qwen2.5-coder-3b-instruct	35.7	13.7	4	2.4	1.3	2
qwen3-0.6b	35	13.7	5	2.4	1.8	1.5
mistralai_mistral_7b_instruct_v0.1	34	13	4	2.4	1.3	1.9
qwen2-1.5b-instruct	32.3	12.3	4	2.3	1.4	1.9
llama-3.2-1B-instruct	27	11.3	11	2.2	2.2	0
qwen2.5-coder-1.5b-instruct	25.8	11.3	4	2.2	0.85	2
qwen1.5-1.8b-chat	25.8	10.6	3	2.2	1.1	1.9
qwen2-math-1.5b-instruct	22	9.81	3	2.1	0.98	1.8
qwen2.5-coder-0.5b-instruct	20.3	9.97	5	2	0.077	2
deepseek_r1_distill_qwen_1.5b	20.3	8.72	4	2	0.79	1.8
qwen2-0.5b-instruct	19.8	9.09	5	2	0.6	1.9
qwen1.5-0.5b-chat	19.2	9.3	5	2	0.42	1.9