lsat_cot: by models

Home


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 80.1 94.5 33.7 10 2 1.5 1.3
qwen3-14b 77 92.3 31.6 11 2.1 1.7 1.2
llama-3.1-70B-instruct 75.2 75.2 30.2 13 2.2 2.2 0
deepseek_r1_distill_llama_70b 74.9 89.6 29.9 9 2.2 1.8 1.2
qwen2-72b-instruct 73.6 81.1 29.4 8 2.2 2 0.82
qwen3-8b 72.2 92.6 28.9 11 2.2 1.7 1.4
deepseek_r1_distill_qwen_32b 71.3 88.3 28 9 2.3 1.8 1.4
qwen2.5-coder-32b-instruct 70.4 90.6 27.3 10 2.3 1.8 1.4
qwen3-4b 70 89.8 27.9 12 2.3 1.8 1.4
qwen1.5-72b-chat 67.5 67.5 25.2 1 2.3 NaN NaN
google_gemma_3_12b_it 67 83.9 25.5 10 2.3 2 1.3
mistralai_mixtral_8x22b_instruct_v0.1 66.8 84.1 25 11 2.3 2 1.2
deepseek_r1_distill_qwen_14b 66.7 87.6 25.5 11 2.3 1.8 1.5
qwen2-math-72b-instruct 66.7 83.1 25.1 7 2.3 1.9 1.3
qwen1.5-32b-chat 66.5 66.5 24.5 1 2.4 NaN NaN
qwen2.5-coder-14b-instruct 63.2 86.4 22.8 11 2.4 2 1.3
qwen2-7b-instruct 58.3 77.2 20.6 12 2.5 2.1 1.2
qwen1.5-14b-chat 57.5 73.2 20 11 2.5 2.2 1.1
mistralai_mixtral_8x7b_instruct_v0.1 57.4 72 20 11 2.5 2.3 0.99
qwen2.5-coder-7b-instruct 56.2 80.4 19.7 12 2.5 2.1 1.4
llama-3.1-8B-instruct 53.6 53.6 18.2 15 2.5 2.5 0
qwen3-1.7b 51.2 79.2 18.2 12 2.5 1.9 1.6
mistralai_ministral_8b_instruct_2410 51 87.3 17.4 12 2.5 1.8 1.7
qwen1.5-7b-chat 50.5 70.7 16.8 12 2.5 2.2 1.2
mistralai_mathstral_7b_v0.1 49.4 80.9 16.7 11 2.5 1.9 1.6
google_gemma_3_4b_it 48.6 76.7 17 13 2.5 2 1.5
mistralai_mistral_7b_instruct_v0.3 48 64.8 15.2 11 2.5 2.2 1.2
mistralai_mistral_7b_instruct_v0.2 47.6 53.8 16.1 11 2.5 2.4 0.72
deepseek_v2_lite_chat 47.5 76.4 16.4 11 2.5 2 1.5
llama-3.2-3B-instruct 46.9 46.9 16 15 2.5 2.5 0
deepseek_r1_distill_llama_8b 45.1 84.4 15.4 12 2.5 1.6 1.9
deepseek_r1_distill_qwen_7b 44.1 78.4 15.3 11 2.5 1.7 1.8
qwen2-math-7b-instruct 43.8 76.4 15.3 12 2.5 1.9 1.5
qwen2.5-coder-3b-instruct 40.2 89.3 14 12 2.4 1.5 1.9
mistralai_mistral_7b_instruct_v0.1 37.4 83.6 12.7 11 2.4 1.5 1.8
qwen3-0.6b 37 64.5 13.7 13 2.4 2 1.4
qwen2-1.5b-instruct 34.6 75.4 11.7 12 2.4 1.7 1.7
qwen1.5-1.8b-chat 27.6 73.4 10.3 12 2.2 1.4 1.7
qwen2.5-coder-1.5b-instruct 27.4 88.8 10.3 12 2.2 1.1 2
llama-3.2-1B-instruct 26.3 26.3 9.65 11 2.2 2.2 0
qwen2-math-1.5b-instruct 24.3 77.2 10.5 11 2.1 1.1 1.8
deepseek_r1_distill_qwen_1.5b 24.2 74.7 9.21 12 2.1 0.98 1.9
qwen2-0.5b-instruct 22.7 86.1 10 13 2.1 0.84 1.9
qwen1.5-0.5b-chat 21.7 86.1 9.62 13 2.1 0.7 1.9
qwen2.5-coder-0.5b-instruct 20.5 91.3 9.33 13 2 0.33 2