lsat_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 78.5 35.4 2 2 1.5 1.3
qwen3-14b 75.2 32.8 3 2.2 1.8 1.2
qwen2-72b-instruct 74.7 33 1 2.2 NaN NaN
llama-3.1-70B-instruct 73.9 32.6 4 2.2 2.2 0
qwen3-8b 72.1 31 3 2.2 1.7 1.4
deepseek_r1_distill_llama_70b 72 30.5 2 2.2 2 1.1
qwen3-4b 69.9 29.9 4 2.3 1.8 1.3
deepseek_r1_distill_qwen_32b 68.7 28.8 2 2.3 1.9 1.3
qwen2.5-coder-32b-instruct 68.6 28.8 2 2.3 1.9 1.4
google_gemma_3_12b_it 68.2 28.5 2 2.3 1.9 1.3
deepseek_r1_distill_qwen_14b 64.1 25.9 4 2.4 1.8 1.5
mistralai_mixtral_8x22b_instruct_v0.1 63.8 25.7 2 2.4 2 1.4
qwen2.5-coder-14b-instruct 61 24.3 3 2.4 1.9 1.5
qwen1.5-14b-chat 57.6 22.4 3 2.5 2.1 1.2
mistralai_mixtral_8x7b_instruct_v0.1 57.5 22.1 3 2.5 2.2 1
qwen2-7b-instruct 56.5 21.9 4 2.5 2 1.4
qwen2.5-coder-7b-instruct 53.4 20.3 4 2.5 2 1.5
llama-3.1-8B-instruct 51.9 20 7 2.5 2.5 0
qwen3-1.7b 50.7 20.1 4 2.5 1.8 1.7
qwen1.5-7b-chat 49.1 18.3 3 2.5 2.1 1.4
mistralai_mistral_7b_instruct_v0.3 48.9 17.8 4 2.5 2.1 1.3
google_gemma_3_4b_it 47.3 18.1 5 2.5 2 1.5
mistralai_mistral_7b_instruct_v0.2 47 17.5 4 2.5 2.3 0.92
mistralai_mathstral_7b_v0.1 46.5 18.1 4 2.5 1.7 1.8
mistralai_ministral_8b_instruct_2410 46 17.5 4 2.5 1.5 2
deepseek_v2_lite_chat 44.5 17.1 2 2.5 1.8 1.7
deepseek_r1_distill_llama_8b 43.9 16.3 4 2.5 1.5 1.9
llama-3.2-3B-instruct 43.9 16.7 8 2.5 2.5 0
deepseek_r1_distill_qwen_7b 41.4 15.7 4 2.5 1.5 1.9
qwen2-math-7b-instruct 40.5 15.9 3 2.4 1.7 1.8
qwen2.5-coder-3b-instruct 35.7 13.7 4 2.4 1.3 2
qwen3-0.6b 35 13.7 5 2.4 1.8 1.5
mistralai_mistral_7b_instruct_v0.1 34 13 4 2.4 1.3 1.9
qwen2-1.5b-instruct 32.3 12.3 4 2.3 1.4 1.9
llama-3.2-1B-instruct 27 11.3 11 2.2 2.2 0
qwen2.5-coder-1.5b-instruct 25.8 11.3 4 2.2 0.85 2
qwen1.5-1.8b-chat 25.8 10.6 3 2.2 1.1 1.9
qwen2-math-1.5b-instruct 22 9.81 3 2.1 0.98 1.8
qwen2.5-coder-0.5b-instruct 20.3 9.97 5 2 0.077 2
deepseek_r1_distill_qwen_1.5b 20.3 8.72 4 2 0.79 1.8
qwen2-0.5b-instruct 19.8 9.09 5 2 0.6 1.9
qwen1.5-0.5b-chat 19.2 9.3 5 2 0.42 1.9