ds1000: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen2.5-coder-14b-instruct 43.2 30.9 12 1.6 1.3 0.88
qwen3-14b 38 26.2 12 1.5 1.3 0.77
qwen2.5-coder-7b-instruct 35.7 24.5 10 1.5 1.2 0.92
google_gemma_3_12b_it 32.8 21.5 11 1.5 1.4 0.57
qwen3-8b 31.2 20.3 12 1.5 1.2 0.79
qwen3-4b 28.9 18.5 12 1.4 1.2 0.79
mistralai_ministral_8b_instruct_2410 24.7 15.3 11 1.4 1 0.9
qwen2.5-coder-3b-instruct 23.8 14.6 12 1.3 1 0.89
qwen2-7b-instruct 23.6 14.4 11 1.3 1.1 0.79
mistralai_mathstral_7b_v0.1 22.6 13.7 11 1.3 0.97 0.89
llama-3.1-8B-instruct 22.3 13.6 15 1.3 1.3 0.15
google_codegemma_1.1_7b_it 20.9 12.7 13 1.3 0.97 0.84
qwen1.5-14b-chat 20.5 12.3 12 1.3 1 0.77
qwen2-math-7b-instruct 20.2 12.2 6 1.3 1 0.79
google_gemma_3_4b_it 19.5 11.2 13 1.3 1.1 0.56
mistralai_mistral_7b_instruct_v0.3 18.2 10.5 11 1.2 0.98 0.73
deepseek_v2_lite_chat 17.4 10.2 11 1.2 0.88 0.81
deepseek_r1_distill_qwen_14b 16.1 9.82 11 1.2 0.82 0.83
qwen3-1.7b 15.2 8.08 12 1.1 0.97 0.58
deepseek_r1_distill_qwen_7b 14.8 8.42 11 1.1 0.81 0.78
qwen2.5-coder-1.5b-instruct 14.1 8.14 11 1.1 0.7 0.85
mistralai_mistral_7b_instruct_v0.2 13.6 7.55 10 1.1 0.85 0.67
google_gemma_2_9b_it 10.6 5.6 12 0.97 0.83 0.51
llama-3.2-3B-instruct 10.1 5.18 17 0.95 0.95 0.072
mistralai_mistral_7b_instruct_v0.1 9.23 4.58 11 0.92 0.65 0.65
qwen1.5-7b-chat 8.67 4.54 12 0.89 0.62 0.64
qwen3-0.6b 7.42 3.48 13 0.83 0.63 0.54
deepseek_r1_distill_llama_8b 6.85 3.71 13 0.8 0.54 0.59
qwen2-1.5b-instruct 5.78 2.68 13 0.74 0.47 0.57
google_gemma_7b_it 4.75 1.94 13 0.67 0.6 0.3
llama-3.2-1B-instruct 4.58 2.16 12 0.66 0.66 0.039
qwen2.5-coder-0.5b-instruct 4.55 2.07 13 0.66 0.38 0.53
qwen2-math-1.5b-instruct 4.08 2.21 4 0.63 0.34 0.52
google_gemma_3_1b_it 3.83 1.65 13 0.61 0.5 0.34
qwen2-0.5b-instruct 2.2 0.906 13 0.46 0.27 0.37
deepseek_r1_distill_qwen_1.5b 1.78 0.971 13 0.42 0.22 0.35
qwen1.5-1.8b-chat 1.37 0.6 12 0.37 0.17 0.32
google_gemma_2b_it 0.385 0.17 13 0.2 0.096 0.17
qwen1.5-0.5b-chat 0.354 0.13 13 0.19 0.071 0.17