CRUXEval-output-T0.8: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
gpt-4-0613+cot 77.2 42 10 1.5 1.2 0.86
gpt-4-0613 68 33.8 10 1.6 1.6 0.53
gpt-3.5-turbo-0613+cot 56.5 25.7 10 1.8 1.3 1.1
deepseek-instruct-33b 48.4 18.3 10 1.8 1.5 0.88
gpt-3.5-turbo-0613 47.5 18 10 1.8 1.6 0.82
deepseek-base-33b 45.5 16.1 10 1.8 1.5 0.96
codetulu-2-34b 43.8 14.9 10 1.8 1.5 0.92
codellama-34b+cot 42.8 16.5 10 1.7 1.2 1.3
magicoder-ds-7b 41.7 13.8 10 1.7 1.5 0.94
wizard-34b 41.4 13.3 10 1.7 1.5 0.84
deepseek-instruct-6.7b 40.5 12.9 10 1.7 1.5 0.83
codellama-python-34b 39.7 12.3 10 1.7 1.5 0.87
codellama-34b 39.3 12.3 10 1.7 1.4 0.97
phind 38.9 12.2 10 1.7 1.5 0.89
deepseek-base-6.7b 38.3 11.8 10 1.7 1.4 0.97
wizard-13b 36.9 10.9 10 1.7 1.4 0.93
codellama-python-13b 36.4 10.7 10 1.7 1.4 0.96
mixtral-8x7b 36.3 10.8 10 1.7 1.4 0.99
codellama-13b 36.1 10.5 10 1.7 1.4 0.98
codellama-13b+cot 34.9 12.1 10 1.7 1.1 1.3
codellama-python-7b 32.4 8.91 10 1.7 1.4 0.95
codellama-7b 30.9 8.22 10 1.6 1.3 0.98
starcoderbase-16b 30.7 8.17 10 1.6 1.3 0.94
mistral-7b 30.1 8.19 10 1.6 1.3 1
phi-2 29.7 8.18 10 1.6 1.3 0.96
codellama-7b+cot 29.1 8.99 10 1.6 1.1 1.2
starcoderbase-7b 28.9 7.3 10 1.6 1.3 0.93
deepseek-instruct-1.3b 27.4 7.41 10 1.6 1.3 0.83
deepseek-base-1.3b 25.9 6.94 10 1.5 1.2 0.99
phi-1.5 21.7 6.39 10 1.5 1.1 0.96
phi-1 19.3 5.43 10 1.4 1.1 0.82