ds1000: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
qwen2.5-coder-14b-instruct 38.6 54 29 3 1.5 1.2 1
qwen3-14b 37 47.9 27.3 3 1.5 1.3 0.84
google_gemma_3_12b_it 32.2 40.3 22.9 4 1.5 1.3 0.67
qwen3-8b 30 41.5 21.3 3 1.4 1.2 0.88
qwen3-4b 28.5 42.6 19.9 4 1.4 1.2 0.85
qwen2.5-coder-7b-instruct 28.5 46.2 20.4 4 1.4 1 0.99
qwen2-7b-instruct 21.2 35 14.1 4 1.3 0.97 0.85
google_gemma_3_4b_it 19.2 27.7 12.3 5 1.2 1.1 0.6
qwen1.5-14b-chat 18.8 28.9 12.3 3 1.2 0.94 0.8
google_codegemma_1.1_7b_it 18.6 35.6 12.2 5 1.2 0.9 0.84
mistralai_ministral_8b_instruct_2410 17.7 34.7 11.4 4 1.2 0.81 0.9
qwen2.5-coder-3b-instruct 16.8 33.2 10.9 4 1.2 0.76 0.91
mistralai_mathstral_7b_v0.1 16.8 32.2 10.7 4 1.2 0.81 0.86
mistralai_mistral_7b_instruct_v0.3 16.5 29.3 10.6 4 1.2 0.86 0.8
qwen3-1.7b 14.6 24.1 8.76 4 1.1 0.89 0.68
deepseek_r1_distill_qwen_14b 14 29.8 9.24 4 1.1 0.69 0.85
llama-3.1-8B-instruct 13.9 14 8.71 7 1.1 1.1 0.085
deepseek_v2_lite_chat 13.7 23.1 8.4 3 1.1 0.77 0.77
qwen2-math-7b-instruct 13.4 20.3 8.58 2 1.1 0.68 0.83
deepseek_r1_distill_qwen_7b 11.8 24.8 7.34 4 1 0.66 0.78
mistralai_mistral_7b_instruct_v0.2 11.7 22.5 7.17 4 1 0.72 0.71
google_gemma_2_9b_it 10.5 16.4 6.19 3 0.97 0.78 0.58
qwen2.5-coder-1.5b-instruct 8.55 21.1 5.16 4 0.88 0.47 0.75
qwen1.5-7b-chat 6.77 13.3 3.9 3 0.79 0.5 0.62
llama-3.2-3B-instruct 6.31 6.4 3.39 10 0.77 0.77 0.075
qwen3-0.6b 5.92 13.5 3.14 5 0.75 0.52 0.53
mistralai_mistral_7b_instruct_v0.1 5.83 13.4 3.1 4 0.74 0.46 0.58
deepseek_r1_distill_llama_8b 4.83 12 2.98 4 0.68 0.39 0.55
google_gemma_7b_it 4.6 7.2 2.22 4 0.66 0.56 0.35
google_gemma_3_1b_it 3.6 6.4 1.74 4 0.59 0.47 0.36
qwen2.5-coder-0.5b-instruct 2.96 8.9 1.49 5 0.54 0.28 0.45
qwen2-1.5b-instruct 2.93 8.6 1.51 4 0.53 0.24 0.48
llama-3.2-1B-instruct 1.49 1.5 0.786 13 0.38 0.38 0.028
deepseek_r1_distill_qwen_1.5b 1.23 3.2 0.73 4 0.35 0.2 0.29
qwen2-math-1.5b-instruct 1.13 2.7 0.606 3 0.33 0.16 0.29
qwen2-0.5b-instruct 1.12 3.6 0.447 5 0.33 0.17 0.28
qwen1.5-1.8b-chat 0.533 1.3 0.258 3 0.23 0.099 0.21
google_gemma_2b_it 0.45 1.7 0.197 4 0.21 0.038 0.21
qwen1.5-0.5b-chat 0.06 0.3 0.0376 5 0.077 0 0.077