human_eval_plus: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-14b 77.2 44.4 3 3.3 2.8 1.7
google_gemma_3_27b_it 75 41.8 3 3.4 3.2 1.2
qwen2.5-coder-32b-instruct 74.1 42.6 2 3.4 2.5 2.3
qwen3-32b 73.4 41.7 3 3.5 2.6 2.3
google_gemma_3_12b_it 72.9 40.2 4 3.5 3.3 1.2
qwen3-4b 69.4 38.1 4 3.6 3.1 1.8
qwen3-8b 68.9 38.7 3 3.6 2.8 2.3
qwen2.5-coder-14b-instruct 68.1 38.7 3 3.6 2.3 2.9
google_gemma_2_27b_it 66.8 35.9 2 3.7 3.2 1.8
google_gemma_3_4b_it 60.9 32.1 5 3.8 3.5 1.5
mistralai_mixtral_8x22b_instruct_v0.1 60 31.4 3 3.8 2.7 2.7
google_gemma_2_9b_it 56.1 29 4 3.9 3.5 1.7
llama-3.1-8B-instruct 50.6 24.8 7 3.9 3.9 0
qwen3-1.7b 50.5 25.6 4 3.9 3.2 2.2
deepseek_r1_distill_qwen_32b 47 25.4 2 3.9 1.6 3.6
qwen2-math-72b-instruct 45.7 24.4 2 3.9 2.3 3.1
google_codegemma_1.1_7b_it 44.9 21.4 5 3.9 3.1 2.4
qwen2-72b-instruct 43.3 23.2 2 3.9 1.8 3.4
deepseek_r1_distill_qwen_14b 43.1 23.8 4 3.9 1.8 3.4
deepseek_r1_distill_llama_70b 43 24.1 2 3.9 2 3.3
qwen1.5-14b-chat 42.3 19.9 3 3.9 2.9 2.5
qwen1.5-32b-chat 41.5 20.7 3 3.8 2.8 2.6
deepseek_v2_lite_chat 36.2 17 3 3.8 2.6 2.7
qwen2-7b-instruct 35.5 17.9 4 3.7 1.9 3.2
google_gemma_3_1b_it 35.5 16.4 4 3.7 3.5 1.3
qwen2.5-coder-7b-instruct 35.4 17.5 4 3.7 2.1 3.1
qwen1.5-72b-chat 34.1 16.1 2 3.7 2.5 2.7
qwen2.5-coder-3b-instruct 33.7 17.4 4 3.7 1.8 3.2
llama-3.2-3B-instruct 33.5 15.1 10 3.7 3.7 0
mistralai_ministral_8b_instruct_2410 32 15.7 4 3.6 1.9 3.1
mistralai_mixtral_8x7b_instruct_v0.1 30.3 14.2 3 3.6 2.7 2.4
mistralai_mistral_7b_instruct_v0.3 30.2 13.2 4 3.6 2.6 2.5
qwen1.5-7b-chat 29.7 13.3 3 3.6 2.5 2.5
deepseek_r1_distill_llama_8b 24.4 12.2 4 3.4 1.6 2.9
mistralai_mathstral_7b_v0.1 23.8 10.4 4 3.3 1.8 2.8
qwen2-math-7b-instruct 23.5 10.9 2 3.3 2.1 2.6
qwen2.5-coder-1.5b-instruct 22.4 10.3 4 3.3 1.6 2.8
deepseek_r1_distill_qwen_7b 20.7 10.2 4 3.2 1.5 2.8
google_gemma_7b_it 20.4 8.83 4 3.1 2.5 1.9
llama-3.2-1B-instruct 20.1 8.17 13 3.1 3.1 0
qwen2.5-coder-0.5b-instruct 18.4 8.18 5 3 1.4 2.7
qwen3-0.6b 17.1 7.15 5 2.9 1.9 2.3
mistralai_mistral_7b_instruct_v0.1 13.6 5.29 4 2.7 1.5 2.2
google_gemma_2b_it 13.3 4.63 4 2.6 2.2 1.5
mistralai_mistral_7b_instruct_v0.2 7.93 2.95 4 2.1 1.1 1.8
qwen2-1.5b-instruct 5.79 2.42 4 1.8 0.54 1.7
qwen1.5-1.8b-chat 3.66 1.27 3 1.5 0.73 1.3
qwen2-0.5b-instruct 2.68 0.851 5 1.3 0.57 1.1
qwen2-math-1.5b-instruct 1.63 0.621 3 0.99 0.69 0.7
qwen1.5-0.5b-chat 1.22 0.356 5 0.86 0.42 0.75
deepseek_r1_distill_qwen_1.5b 0.457 0.162 4 0.53 0 0.53