lsat_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 78.5 85.9 35.4 2 2 1.5 1.3
qwen3-14b 75.2 83.6 32.8 3 2.2 1.8 1.2
qwen2-72b-instruct 74.7 74.7 33 1 2.2 NaN NaN
llama-3.1-70B-instruct 73.9 73.9 32.6 4 2.2 2.2 0
qwen3-8b 72.1 84.9 31 3 2.2 1.7 1.4
deepseek_r1_distill_llama_70b 72 76.4 30.5 2 2.2 2 1.1
qwen3-4b 69.9 82.6 29.9 4 2.3 1.8 1.3
deepseek_r1_distill_qwen_32b 68.7 75.9 28.8 2 2.3 1.9 1.3
qwen2.5-coder-32b-instruct 68.6 76.2 28.8 2 2.3 1.9 1.4
google_gemma_3_12b_it 68.2 74.9 28.5 2 2.3 1.9 1.3
deepseek_r1_distill_qwen_14b 64.1 78.9 25.9 4 2.4 1.8 1.5
mistralai_mixtral_8x22b_instruct_v0.1 63.8 71.2 25.7 2 2.4 2 1.4
qwen2.5-coder-14b-instruct 61 74.9 24.3 3 2.4 1.9 1.5
qwen1.5-14b-chat 57.6 67.5 22.4 3 2.5 2.1 1.2
mistralai_mixtral_8x7b_instruct_v0.1 57.5 65 22.1 3 2.5 2.2 1
qwen2-7b-instruct 56.5 72 21.9 4 2.5 2 1.4
qwen2.5-coder-7b-instruct 53.4 70.5 20.3 4 2.5 2 1.5
llama-3.1-8B-instruct 51.9 51.9 20 7 2.5 2.5 0
qwen3-1.7b 50.7 73.4 20.1 4 2.5 1.8 1.7
qwen1.5-7b-chat 49.1 61.8 18.3 3 2.5 2.1 1.4
mistralai_mistral_7b_instruct_v0.3 48.9 63.5 17.8 4 2.5 2.1 1.3
google_gemma_3_4b_it 47.3 67.7 18.1 5 2.5 2 1.5
mistralai_mistral_7b_instruct_v0.2 47 53.1 17.5 4 2.5 2.3 0.92
mistralai_mathstral_7b_v0.1 46.5 71.5 18.1 4 2.5 1.7 1.8
mistralai_ministral_8b_instruct_2410 46 74.9 17.5 4 2.5 1.5 2
deepseek_v2_lite_chat 44.5 56.1 17.1 2 2.5 1.8 1.7
deepseek_r1_distill_llama_8b 43.9 73.7 16.3 4 2.5 1.5 1.9
llama-3.2-3B-instruct 43.9 43.9 16.7 8 2.5 2.5 0
deepseek_r1_distill_qwen_7b 41.4 67.7 15.7 4 2.5 1.5 1.9
qwen2-math-7b-instruct 40.5 60.8 15.9 3 2.4 1.7 1.8
qwen2.5-coder-3b-instruct 35.7 69 13.7 4 2.4 1.3 2
qwen3-0.6b 35 56.6 13.7 5 2.4 1.8 1.5
mistralai_mistral_7b_instruct_v0.1 34 66.3 13 4 2.4 1.3 1.9
qwen2-1.5b-instruct 32.3 62.5 12.3 4 2.3 1.4 1.9
llama-3.2-1B-instruct 27 27 11.3 11 2.2 2.2 0
qwen2.5-coder-1.5b-instruct 25.8 62.8 11.3 4 2.2 0.85 2
qwen1.5-1.8b-chat 25.8 48.9 10.6 3 2.2 1.1 1.9
qwen2-math-1.5b-instruct 22 45.2 9.81 3 2.1 0.98 1.8
qwen2.5-coder-0.5b-instruct 20.3 67.5 9.97 5 2 0.077 2
deepseek_r1_distill_qwen_1.5b 20.3 50.6 8.72 4 2 0.79 1.8
qwen2-0.5b-instruct 19.8 61 9.09 5 2 0.6 1.9
qwen1.5-0.5b-chat 19.2 61.8 9.3 5 2 0.42 1.9