lsat_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 80.1 33.7 10 2 1.5 1.3
qwen3-14b 77 31.6 11 2.1 1.7 1.2
llama-3.1-70B-instruct 75.2 30.2 13 2.2 2.2 0
deepseek_r1_distill_llama_70b 74.9 29.9 9 2.2 1.8 1.2
qwen2-72b-instruct 73.6 29.4 8 2.2 2 0.82
qwen3-8b 72.2 28.9 11 2.2 1.7 1.4
deepseek_r1_distill_qwen_32b 71.3 28 9 2.3 1.8 1.4
qwen2.5-coder-32b-instruct 70.4 27.3 10 2.3 1.8 1.4
qwen3-4b 70 27.9 12 2.3 1.8 1.4
qwen1.5-72b-chat 67.5 25.2 1 2.3 NaN NaN
google_gemma_3_12b_it 67 25.5 10 2.3 2 1.3
mistralai_mixtral_8x22b_instruct_v0.1 66.8 25 11 2.3 2 1.2
deepseek_r1_distill_qwen_14b 66.7 25.5 11 2.3 1.8 1.5
qwen2-math-72b-instruct 66.7 25.1 7 2.3 1.9 1.3
qwen1.5-32b-chat 66.5 24.5 1 2.4 NaN NaN
qwen2.5-coder-14b-instruct 63.2 22.8 11 2.4 2 1.3
qwen2-7b-instruct 58.3 20.6 12 2.5 2.1 1.2
qwen1.5-14b-chat 57.5 20 11 2.5 2.2 1.1
mistralai_mixtral_8x7b_instruct_v0.1 57.4 20 11 2.5 2.3 0.99
qwen2.5-coder-7b-instruct 56.2 19.7 12 2.5 2.1 1.4
llama-3.1-8B-instruct 53.6 18.2 15 2.5 2.5 0
qwen3-1.7b 51.2 18.2 12 2.5 1.9 1.6
mistralai_ministral_8b_instruct_2410 51 17.4 12 2.5 1.8 1.7
qwen1.5-7b-chat 50.5 16.8 12 2.5 2.2 1.2
mistralai_mathstral_7b_v0.1 49.4 16.7 11 2.5 1.9 1.6
google_gemma_3_4b_it 48.6 17 13 2.5 2 1.5
mistralai_mistral_7b_instruct_v0.3 48 15.2 11 2.5 2.2 1.2
mistralai_mistral_7b_instruct_v0.2 47.6 16.1 11 2.5 2.4 0.72
deepseek_v2_lite_chat 47.5 16.4 11 2.5 2 1.5
llama-3.2-3B-instruct 46.9 16 15 2.5 2.5 0
deepseek_r1_distill_llama_8b 45.1 15.4 12 2.5 1.6 1.9
deepseek_r1_distill_qwen_7b 44.1 15.3 11 2.5 1.7 1.8
qwen2-math-7b-instruct 43.8 15.3 12 2.5 1.9 1.5
qwen2.5-coder-3b-instruct 40.2 14 12 2.4 1.5 1.9
mistralai_mistral_7b_instruct_v0.1 37.4 12.7 11 2.4 1.5 1.8
qwen3-0.6b 37 13.7 13 2.4 2 1.4
qwen2-1.5b-instruct 34.6 11.7 12 2.4 1.7 1.7
qwen1.5-1.8b-chat 27.6 10.3 12 2.2 1.4 1.7
qwen2.5-coder-1.5b-instruct 27.4 10.3 12 2.2 1.1 2
llama-3.2-1B-instruct 26.3 9.65 11 2.2 2.2 0
qwen2-math-1.5b-instruct 24.3 10.5 11 2.1 1.1 1.8
deepseek_r1_distill_qwen_1.5b 24.2 9.21 12 2.1 0.98 1.9
qwen2-0.5b-instruct 22.7 10 13 2.1 0.84 1.9
qwen1.5-0.5b-chat 21.7 9.62 13 2.1 0.7 1.9
qwen2.5-coder-0.5b-instruct 20.5 9.33 13 2 0.33 2