lsat_cot: by models

Home

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	pass@count	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	80.1	94.5	33.7	10	2	1.5	1.3
qwen3-14b	77	92.3	31.6	11	2.1	1.7	1.2
llama-3.1-70B-instruct	75.2	75.2	30.2	13	2.2	2.2	0
deepseek_r1_distill_llama_70b	74.9	89.6	29.9	9	2.2	1.8	1.2
qwen2-72b-instruct	73.6	81.1	29.4	8	2.2	2	0.82
qwen3-8b	72.2	92.6	28.9	11	2.2	1.7	1.4
deepseek_r1_distill_qwen_32b	71.3	88.3	28	9	2.3	1.8	1.4
qwen2.5-coder-32b-instruct	70.4	90.6	27.3	10	2.3	1.8	1.4
qwen3-4b	70	89.8	27.9	12	2.3	1.8	1.4
qwen1.5-72b-chat	67.5	67.5	25.2	1	2.3	NaN	NaN
google_gemma_3_12b_it	67	83.9	25.5	10	2.3	2	1.3
mistralai_mixtral_8x22b_instruct_v0.1	66.8	84.1	25	11	2.3	2	1.2
deepseek_r1_distill_qwen_14b	66.7	87.6	25.5	11	2.3	1.8	1.5
qwen2-math-72b-instruct	66.7	83.1	25.1	7	2.3	1.9	1.3
qwen1.5-32b-chat	66.5	66.5	24.5	1	2.4	NaN	NaN
qwen2.5-coder-14b-instruct	63.2	86.4	22.8	11	2.4	2	1.3
qwen2-7b-instruct	58.3	77.2	20.6	12	2.5	2.1	1.2
qwen1.5-14b-chat	57.5	73.2	20	11	2.5	2.2	1.1
mistralai_mixtral_8x7b_instruct_v0.1	57.4	72	20	11	2.5	2.3	0.99
qwen2.5-coder-7b-instruct	56.2	80.4	19.7	12	2.5	2.1	1.4
llama-3.1-8B-instruct	53.6	53.6	18.2	15	2.5	2.5	0
qwen3-1.7b	51.2	79.2	18.2	12	2.5	1.9	1.6
mistralai_ministral_8b_instruct_2410	51	87.3	17.4	12	2.5	1.8	1.7
qwen1.5-7b-chat	50.5	70.7	16.8	12	2.5	2.2	1.2
mistralai_mathstral_7b_v0.1	49.4	80.9	16.7	11	2.5	1.9	1.6
google_gemma_3_4b_it	48.6	76.7	17	13	2.5	2	1.5
mistralai_mistral_7b_instruct_v0.3	48	64.8	15.2	11	2.5	2.2	1.2
mistralai_mistral_7b_instruct_v0.2	47.6	53.8	16.1	11	2.5	2.4	0.72
deepseek_v2_lite_chat	47.5	76.4	16.4	11	2.5	2	1.5
llama-3.2-3B-instruct	46.9	46.9	16	15	2.5	2.5	0
deepseek_r1_distill_llama_8b	45.1	84.4	15.4	12	2.5	1.6	1.9
deepseek_r1_distill_qwen_7b	44.1	78.4	15.3	11	2.5	1.7	1.8
qwen2-math-7b-instruct	43.8	76.4	15.3	12	2.5	1.9	1.5
qwen2.5-coder-3b-instruct	40.2	89.3	14	12	2.4	1.5	1.9
mistralai_mistral_7b_instruct_v0.1	37.4	83.6	12.7	11	2.4	1.5	1.8
qwen3-0.6b	37	64.5	13.7	13	2.4	2	1.4
qwen2-1.5b-instruct	34.6	75.4	11.7	12	2.4	1.7	1.7
qwen1.5-1.8b-chat	27.6	73.4	10.3	12	2.2	1.4	1.7
qwen2.5-coder-1.5b-instruct	27.4	88.8	10.3	12	2.2	1.1	2
llama-3.2-1B-instruct	26.3	26.3	9.65	11	2.2	2.2	0
qwen2-math-1.5b-instruct	24.3	77.2	10.5	11	2.1	1.1	1.8
deepseek_r1_distill_qwen_1.5b	24.2	74.7	9.21	12	2.1	0.98	1.9
qwen2-0.5b-instruct	22.7	86.1	10	13	2.1	0.84	1.9
qwen1.5-0.5b-chat	21.7	86.1	9.62	13	2.1	0.7	1.9
qwen2.5-coder-0.5b-instruct	20.5	91.3	9.33	13	2	0.33	2