leetcode: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 40.1 38.5 12 3.7 3.2 1.8
google_gemma_2_27b_it 20.1 19 10 3 1.8 2.4
llama-3.1-8B-instruct 18.3 17 15 2.9 2.7 0.91
google_gemma_3_4b_it 17.5 16.4 13 2.8 2.5 1.3
google_gemma_2_9b_it 16.8 15.7 11 2.8 2 2
llama-3.2-3B-instruct 9.74 8.93 17 2.2 2.1 0.64
google_gemma_3_12b_it 6.97 6.56 11 1.9 1.5 1.2
google_codegemma_1.1_7b_it 6.28 5.76 13 1.8 1.2 1.4
llama-3.2-1B-instruct 4.03 3.64 12 1.5 1.4 0.46
google_gemma_7b_it 2.5 2.24 12 1.2 0.87 0.77
mistralai_mixtral_8x22b_instruct_v0.1 0.859 0.795 11 0.69 0.084 0.68
google_gemma_3_1b_it 0.694 0.629 12 0.62 0.25 0.57
google_gemma_2b_it 0.641 0.565 13 0.59 0.24 0.54
mistralai_mathstral_7b_v0.1 0.0505 0.0479 11 0.17 0 0.17
qwen3-32b 0.0505 0.048 11 0.17 0 0.17
qwen2.5-coder-14b-instruct 0.0463 0.044 12 0.16 0 0.16
qwen2.5-coder-3b-instruct 0.0463 0.0463 12 0.16 0 0.16
deepseek_r1_distill_llama_70b 0 0 9 0 0 0
deepseek_r1_distill_qwen_1.5b 0 0 12 0 0 0
deepseek_v2_lite_chat 0 0 11 0 0 0
deepseek_r1_distill_qwen_7b 0 0 11 0 0 0
deepseek_r1_distill_qwen_32b 0 0 9 0 0 0
deepseek_r1_distill_qwen_14b 0 0 11 0 0 0
deepseek_r1_distill_llama_8b 0 0 12 0 0 0
mistralai_mistral_7b_instruct_v0.2 0 0 10 0 0 0
mistralai_mistral_7b_instruct_v0.1 0 0 11 0 0 0
mistralai_ministral_8b_instruct_2410 0 0 11 0 0 0
qwen1.5-1.8b-chat 0 0 11 0 0 0
qwen1.5-14b-chat 0 0 12 0 0 0
qwen1.5-32b-chat 0 0 11 0 0 0
qwen1.5-72b-chat 0 0 10 0 0 0
qwen1.5-7b-chat 0 0 12 0 0 0
mistralai_mistral_7b_instruct_v0.3 0 0 11 0 0 0
mistralai_mixtral_8x7b_instruct_v0.1 0 0 12 0 0 0
qwen1.5-0.5b-chat 0 0 13 0 0 0
qwen2-72b-instruct 0 0 10 0 0 0
qwen2-1.5b-instruct 0 0 13 0 0 0
qwen2-0.5b-instruct 0 0 13 0 0 0
qwen2-7b-instruct 0 0 11 0 0 0
qwen2-math-7b-instruct 0 0 6 0 0 0
qwen2.5-coder-0.5b-instruct 0 0 13 0 0 0
qwen2-math-72b-instruct 0 0 10 0 0 0
qwen2-math-1.5b-instruct 0 0 4 0 0 0
qwen2.5-coder-32b-instruct 0 0 10 0 0 0
qwen2.5-coder-1.5b-instruct 0 0 11 0 0 0
qwen3-0.6b 0 0 13 0 0 0
qwen2.5-coder-7b-instruct 0 0 10 0 0 0
qwen3-1.7b 0 0 12 0 0 0
qwen3-14b 0 0 12 0 0 0
qwen3-4b 0 0 12 0 0 0
qwen3-8b 0 0 12 0 0 0