safim: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
deepseek-coder-33b 66.5 30.4 1 0.36 NaN NaN
deepseek-coder-6.7b 60.9 25.7 1 0.37 NaN NaN
wizardcoder-33b 58.4 25.2 1 0.37 NaN NaN
codellama-13b 50 18.5 1 0.38 NaN NaN
starcoderbase-16b 50 17.6 1 0.38 NaN NaN
gpt-4-1106-preview 49.2 22.6 1 0.38 NaN NaN
deepseek-coder-1.3b 48.3 17 1 0.38 NaN NaN
wizardcoder-15b 47.6 16.3 1 0.38 NaN NaN
codellama-34b 46.9 16.6 1 0.38 NaN NaN
codellama-7b 44.7 15.6 1 0.38 NaN NaN
mixtral-8x7b 42.7 14.2 1 0.37 NaN NaN
wizardcoder-3b 41.1 12.6 1 0.37 NaN NaN
gpt-3.5-turbo-0301 35.1 13.2 1 0.36 NaN NaN
wizardcoder-1b 34.7 9.37 1 0.36 NaN NaN
codegen-16b 31.2 7.36 1 0.35 NaN NaN
phi-2 29.5 6.83 1 0.35 NaN NaN
codegen-6b 29.5 6.71 1 0.35 NaN NaN
codegen-2b 28.6 6.31 1 0.34 NaN NaN
incoder-6b 27.3 7.68 1 0.34 NaN NaN
phi-1.5 24.8 5.18 1 0.33 NaN NaN
incoder-1b 22.6 6.1 1 0.32 NaN NaN
codegen-350m 21.5 4.35 1 0.31 NaN NaN