human_eval: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 87.5 42.9 11 2.6 1.9 1.8
google_gemma_3_27b_it 86.5 41.4 12 2.7 2.5 0.79
qwen3-14b 86 41.7 12 2.7 2.3 1.5
qwen2.5-coder-32b-instruct 83.3 41 11 2.9 1.9 2.2
google_gemma_3_12b_it 82.8 38.7 11 2.9 2.7 1.1
qwen2.5-coder-14b-instruct 82.3 39.9 12 3 1.9 2.3
qwen3-4b 78.9 36.7 12 3.2 2.7 1.7
qwen3-8b 77.6 36.6 12 3.3 2.5 2.1
google_gemma_2_27b_it 75.2 33.7 10 3.4 3.1 1.4
mistralai_mixtral_8x22b_instruct_v0.1 74.2 33.4 11 3.4 2.6 2.2
qwen2-math-72b-instruct 70.6 31.8 11 3.6 2.4 2.6
deepseek_r1_distill_qwen_32b 70.2 33.8 11 3.6 1.8 3.1
google_gemma_3_4b_it 69.6 30.5 13 3.6 3.3 1.4
llama-3.1-8B-instruct 65.9 28 15 3.7 3.7 0
google_gemma_2_9b_it 63.1 26.6 11 3.8 3.5 1.4
deepseek_r1_distill_qwen_14b 58.6 27.7 11 3.8 2 3.3
qwen2-7b-instruct 57.7 24.7 11 3.9 2.3 3.1
qwen3-1.7b 57.1 23.9 12 3.9 3.2 2.1
google_codegemma_1.1_7b_it 56.4 22.3 13 3.9 3.1 2.3
qwen2-72b-instruct 55 25.6 11 3.9 2 3.3
deepseek_r1_distill_llama_70b 54.4 25.1 11 3.9 2 3.3
qwen2.5-coder-7b-instruct 54 23.5 10 3.9 2.1 3.3
qwen1.5-14b-chat 52.1 20.3 12 3.9 3.1 2.4
qwen2.5-coder-3b-instruct 50 21.3 12 3.9 2.2 3.2
deepseek_v2_lite_chat 48.8 19.2 11 3.9 2.7 2.8
qwen1.5-32b-chat 48.2 19.1 11 3.9 3.2 2.3
mistralai_ministral_8b_instruct_2410 46.1 19.2 11 3.9 2.3 3.1
llama-3.2-3B-instruct 45.7 16.9 17 3.9 3.9 0
mistralai_mathstral_7b_v0.1 44 17.1 11 3.9 2.3 3.1
qwen2.5-coder-1.5b-instruct 43.1 16.6 11 3.9 2.3 3.1
deepseek_r1_distill_llama_8b 42.6 18.1 13 3.9 2.2 3.2
qwen1.5-72b-chat 42.6 16.6 11 3.9 3 2.5
mistralai_mistral_7b_instruct_v0.3 41.6 14.8 11 3.8 3.1 2.2
google_gemma_3_1b_it 41.3 15.4 13 3.8 3.6 1.4
qwen1.5-7b-chat 39.8 14.4 12 3.8 2.9 2.4
mistralai_mixtral_8x7b_instruct_v0.1 38.7 14.9 12 3.8 2.8 2.6
qwen2.5-coder-0.5b-instruct 36.4 13.6 13 3.8 2.3 3
qwen2-math-7b-instruct 32.2 12.4 6 3.6 2.4 2.8
deepseek_r1_distill_qwen_7b 31.8 13 11 3.6 2 3
llama-3.2-1B-instruct 30.5 10.6 12 3.6 3.6 0
mistralai_mistral_7b_instruct_v0.1 28.3 9.4 11 3.5 2.4 2.6
google_gemma_7b_it 26.3 8.58 13 3.4 2.9 1.9
qwen3-0.6b 25.1 8.03 13 3.4 2.5 2.3
google_gemma_2b_it 17 4.73 13 2.9 2.4 1.6
qwen2-1.5b-instruct 16.4 6.09 13 2.9 1.2 2.6
qwen2-0.5b-instruct 10.3 2.82 13 2.4 1.4 1.9
mistralai_mistral_7b_instruct_v0.2 9.15 2.69 10 2.3 1.2 1.9
qwen1.5-1.8b-chat 7.43 2.09 11 2 1.1 1.7
qwen2-math-1.5b-instruct 3.96 1.45 4 1.5 0.98 1.2
qwen1.5-0.5b-chat 2.91 0.517 13 1.3 0.75 1.1
deepseek_r1_distill_qwen_1.5b 2.67 0.804 13 1.3 0.39 1.2