ds1000: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen2.5-coder-14b-instruct 38.6 29 3 1.5 1.2 1
qwen3-14b 37 27.3 3 1.5 1.3 0.84
google_gemma_3_12b_it 32.2 22.9 4 1.5 1.3 0.67
qwen3-8b 30 21.3 3 1.4 1.2 0.88
qwen3-4b 28.5 19.9 4 1.4 1.2 0.85
qwen2.5-coder-7b-instruct 28.5 20.4 4 1.4 1 0.99
qwen2-7b-instruct 21.2 14.1 4 1.3 0.97 0.85
google_gemma_3_4b_it 19.2 12.3 5 1.2 1.1 0.6
qwen1.5-14b-chat 18.8 12.3 3 1.2 0.94 0.8
google_codegemma_1.1_7b_it 18.6 12.2 5 1.2 0.9 0.84
mistralai_ministral_8b_instruct_2410 17.7 11.4 4 1.2 0.81 0.9
qwen2.5-coder-3b-instruct 16.8 10.9 4 1.2 0.76 0.91
mistralai_mathstral_7b_v0.1 16.8 10.7 4 1.2 0.81 0.86
mistralai_mistral_7b_instruct_v0.3 16.5 10.6 4 1.2 0.86 0.8
qwen3-1.7b 14.6 8.76 4 1.1 0.89 0.68
deepseek_r1_distill_qwen_14b 14 9.24 4 1.1 0.69 0.85
llama-3.1-8B-instruct 13.9 8.71 7 1.1 1.1 0.085
deepseek_v2_lite_chat 13.7 8.4 3 1.1 0.77 0.77
qwen2-math-7b-instruct 13.4 8.58 2 1.1 0.68 0.83
deepseek_r1_distill_qwen_7b 11.8 7.34 4 1 0.66 0.78
mistralai_mistral_7b_instruct_v0.2 11.7 7.17 4 1 0.72 0.71
google_gemma_2_9b_it 10.5 6.19 3 0.97 0.78 0.58
qwen2.5-coder-1.5b-instruct 8.55 5.16 4 0.88 0.47 0.75
qwen1.5-7b-chat 6.77 3.9 3 0.79 0.5 0.62
llama-3.2-3B-instruct 6.31 3.39 10 0.77 0.77 0.075
qwen3-0.6b 5.92 3.14 5 0.75 0.52 0.53
mistralai_mistral_7b_instruct_v0.1 5.83 3.1 4 0.74 0.46 0.58
deepseek_r1_distill_llama_8b 4.83 2.98 4 0.68 0.39 0.55
google_gemma_7b_it 4.6 2.22 4 0.66 0.56 0.35
google_gemma_3_1b_it 3.6 1.74 4 0.59 0.47 0.36
qwen2.5-coder-0.5b-instruct 2.96 1.49 5 0.54 0.28 0.45
qwen2-1.5b-instruct 2.93 1.51 4 0.53 0.24 0.48
llama-3.2-1B-instruct 1.49 0.786 13 0.38 0.38 0.028
deepseek_r1_distill_qwen_1.5b 1.23 0.73 4 0.35 0.2 0.29
qwen2-math-1.5b-instruct 1.13 0.606 3 0.33 0.16 0.29
qwen2-0.5b-instruct 1.12 0.447 5 0.33 0.17 0.28
qwen1.5-1.8b-chat 0.533 0.258 3 0.23 0.099 0.21
google_gemma_2b_it 0.45 0.197 4 0.21 0.038 0.21
qwen1.5-0.5b-chat 0.06 0.0376 5 0.077 0 0.077