CRUXEval-input-T0.2: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
gpt-4-turbo-2024-04-09+cot 75.7 85.5 36.1 2.7 1.5 NaN NaN
gpt-4o+cot 75.6 83.6 35.5 2.7 1.5 NaN NaN
gpt-4-0613+cot 75.5 91.6 35.4 10 1.5 1.2 0.93
claude-3-opus-20240229+cot 73.4 73.4 34.4 1 1.6 NaN NaN
gpt-4-0613 69.8 74.2 30.7 10 1.6 1.6 0.45
gpt-4-turbo-2024-04-09 68.5 71.9 29.5 3 1.6 1.6 0.53
gpt-4o 65.1 68.1 27.4 3 1.7 1.6 0.52
claude-3-opus-20240229 64.2 64.2 27.6 1 1.7 NaN NaN
gpt-3.5-turbo-0613+cot 50.3 77.8 19.7 10 1.8 1.3 1.2
codellama-34b+cot 50.1 70.5 18.8 10 1.8 1.5 0.98
codetulu-2-34b 49.2 60.5 16.5 10 1.8 1.6 0.73
gpt-3.5-turbo-0613 49 57 17.4 10 1.8 1.7 0.58
codellama-13b+cot 47.4 65.8 16.9 10 1.8 1.5 0.89
codellama-34b 47.2 59.2 15.3 10 1.8 1.6 0.74
phind 47.2 56.1 16 10 1.8 1.6 0.65
deepseek-base-33b 46.5 58 14.9 10 1.8 1.6 0.75
deepseek-instruct-33b 46.5 57 15.5 10 1.8 1.6 0.69
codellama-python-34b 43.9 56.4 13.9 10 1.8 1.6 0.74
wizard-34b 42.7 51.5 13.7 10 1.7 1.6 0.63
codellama-13b 42.5 54.9 12.9 10 1.7 1.6 0.8
deepseek-base-6.7b 41.9 54.4 12.8 10 1.7 1.6 0.74
magicoder-ds-7b 41.7 51.1 12.6 10 1.7 1.6 0.66
codellama-7b+cot 40.4 62 14.1 10 1.7 1.4 1
codellama-python-13b 39.7 53.1 11.6 10 1.7 1.5 0.79
mixtral-8x7b 39.3 53.1 12 10 1.7 1.5 0.79
deepseek-instruct-6.7b 37.4 46 10.8 10 1.7 1.6 0.64
codellama-python-7b 37.3 48.4 10.9 10 1.7 1.6 0.69
wizard-13b 36.5 45.9 10.4 10 1.7 1.6 0.63
codellama-7b 36 47.5 10.1 10 1.7 1.5 0.72
mistral-7b 35 46.9 9.81 10 1.7 1.5 0.73
phi-2 31.6 44.4 9.38 10 1.6 1.5 0.74
starcoderbase-16b 31.3 44 8.28 10 1.6 1.5 0.74
starcoderbase-7b 29.7 40.6 7.63 10 1.6 1.5 0.69
deepseek-base-1.3b 27.8 37.2 7.34 10 1.6 1.5 0.63
deepseek-instruct-1.3b 27.2 35 7.74 10 1.6 1.5 0.58
phi-1.5 23.2 36 7.8 10 1.5 1.3 0.74
phi-1 13.1 17.6 3.66 10 1.2 1.1 0.44