human_eval: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_27b_it 87 47.1 3 2.6 2.4 1
qwen3-14b 85.4 46.3 3 2.8 2.3 1.5
google_gemma_3_12b_it 82.9 44.1 4 2.9 2.7 1.1
qwen3-32b 80.3 43 3 3.1 1.9 2.5
qwen2.5-coder-32b-instruct 79.3 43.2 2 3.2 1.9 2.5
qwen3-4b 78.8 41.6 4 3.2 2.6 1.8
qwen3-8b 78.7 42.5 3 3.2 2.3 2.2
qwen2.5-coder-14b-instruct 74 39.6 3 3.4 2 2.8
google_gemma_2_27b_it 73.2 37.6 2 3.5 2.9 1.8
mistralai_mixtral_8x22b_instruct_v0.1 71.1 36.1 3 3.5 2.5 2.5
google_gemma_3_4b_it 69.8 34.8 5 3.6 3.3 1.5
google_gemma_2_9b_it 63.3 30.9 4 3.8 3.4 1.7
llama-3.1-8B-instruct 59.8 28 7 3.8 3.8 0
qwen3-1.7b 58.8 28.9 4 3.8 3 2.4
qwen2-math-72b-instruct 57.9 28.8 2 3.9 2.4 3
google_codegemma_1.1_7b_it 54.9 25.5 5 3.9 3 2.5
qwen1.5-14b-chat 51.8 23.6 3 3.9 2.8 2.7
deepseek_r1_distill_qwen_32b 51.5 27.2 2 3.9 1.3 3.7
deepseek_r1_distill_qwen_14b 47.3 24.3 4 3.9 1.7 3.5
qwen2-72b-instruct 47 23.8 2 3.9 1.9 3.4
qwen1.5-32b-chat 46.7 21.7 3 3.9 2.8 2.7
deepseek_r1_distill_llama_70b 45.1 24.2 2 3.9 1.6 3.6
llama-3.2-3B-instruct 42.1 18.2 10 3.9 3.9 0
qwen1.5-72b-chat 41.8 18.3 2 3.9 2.9 2.6
google_gemma_3_1b_it 40.1 17.6 4 3.8 3.4 1.8
qwen2-7b-instruct 39.6 18.6 4 3.8 1.9 3.3
deepseek_v2_lite_chat 39.2 17.9 3 3.8 2 3.2
qwen2.5-coder-3b-instruct 38.6 18 4 3.8 2.1 3.2
qwen2.5-coder-7b-instruct 38.6 17.9 4 3.8 1.9 3.3
mistralai_mistral_7b_instruct_v0.3 37.7 15.3 4 3.8 3 2.3
qwen1.5-7b-chat 36.4 15.2 3 3.8 2.7 2.7
mistralai_ministral_8b_instruct_2410 35.7 15.9 4 3.7 1.9 3.2
mistralai_mixtral_8x7b_instruct_v0.1 35 15.6 3 3.7 2.6 2.7
mistralai_mathstral_7b_v0.1 30.6 13.1 4 3.6 2.1 3
qwen2.5-coder-1.5b-instruct 28.8 12 4 3.5 2.1 2.8
deepseek_r1_distill_llama_8b 27.4 13 4 3.5 1.6 3.1
qwen2-math-7b-instruct 25 11.2 2 3.4 1.6 3
google_gemma_7b_it 24.4 9.48 4 3.4 2.7 2
qwen2.5-coder-0.5b-instruct 22.9 9.36 5 3.3 1.9 2.7
deepseek_r1_distill_qwen_7b 22 9.93 4 3.2 1.4 2.9
llama-3.2-1B-instruct 22 8.34 13 3.2 3.2 0
mistralai_mistral_7b_instruct_v0.1 20.1 7.54 4 3.1 1.7 2.6
qwen3-0.6b 18.8 6.99 5 3 1.9 2.4
google_gemma_2b_it 17.7 5.79 4 3 2.4 1.7
mistralai_mistral_7b_instruct_v0.2 11.7 4.32 4 2.5 1.2 2.2
qwen2-1.5b-instruct 8.38 3.32 4 2.2 0.44 2.1
qwen1.5-1.8b-chat 4.47 1.61 3 1.6 0.93 1.3
qwen2-0.5b-instruct 3.78 0.958 5 1.5 0.81 1.2
qwen2-math-1.5b-instruct 2.24 0.891 3 1.2 0.31 1.1
qwen1.5-0.5b-chat 1.83 0.367 5 1 0.62 0.84
deepseek_r1_distill_qwen_1.5b 1.37 0.484 4 0.91 0 0.91