The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|
| gpt-4-0613+cot | 77.2 | 42 | 10 | 1.5 | 1.2 | 0.86 |
| gpt-4-0613 | 68 | 33.8 | 10 | 1.6 | 1.6 | 0.53 |
| gpt-3.5-turbo-0613+cot | 56.5 | 25.7 | 10 | 1.8 | 1.3 | 1.1 |
| deepseek-instruct-33b | 48.4 | 18.3 | 10 | 1.8 | 1.5 | 0.88 |
| gpt-3.5-turbo-0613 | 47.5 | 18 | 10 | 1.8 | 1.6 | 0.82 |
| deepseek-base-33b | 45.5 | 16.1 | 10 | 1.8 | 1.5 | 0.96 |
| codetulu-2-34b | 43.8 | 14.9 | 10 | 1.8 | 1.5 | 0.92 |
| codellama-34b+cot | 42.8 | 16.5 | 10 | 1.7 | 1.2 | 1.3 |
| magicoder-ds-7b | 41.7 | 13.8 | 10 | 1.7 | 1.5 | 0.94 |
| wizard-34b | 41.4 | 13.3 | 10 | 1.7 | 1.5 | 0.84 |
| deepseek-instruct-6.7b | 40.5 | 12.9 | 10 | 1.7 | 1.5 | 0.83 |
| codellama-python-34b | 39.7 | 12.3 | 10 | 1.7 | 1.5 | 0.87 |
| codellama-34b | 39.3 | 12.3 | 10 | 1.7 | 1.4 | 0.97 |
| phind | 38.9 | 12.2 | 10 | 1.7 | 1.5 | 0.89 |
| deepseek-base-6.7b | 38.3 | 11.8 | 10 | 1.7 | 1.4 | 0.97 |
| wizard-13b | 36.9 | 10.9 | 10 | 1.7 | 1.4 | 0.93 |
| codellama-python-13b | 36.4 | 10.7 | 10 | 1.7 | 1.4 | 0.96 |
| mixtral-8x7b | 36.3 | 10.8 | 10 | 1.7 | 1.4 | 0.99 |
| codellama-13b | 36.1 | 10.5 | 10 | 1.7 | 1.4 | 0.98 |
| codellama-13b+cot | 34.9 | 12.1 | 10 | 1.7 | 1.1 | 1.3 |
| codellama-python-7b | 32.4 | 8.91 | 10 | 1.7 | 1.4 | 0.95 |
| codellama-7b | 30.9 | 8.22 | 10 | 1.6 | 1.3 | 0.98 |
| starcoderbase-16b | 30.7 | 8.17 | 10 | 1.6 | 1.3 | 0.94 |
| mistral-7b | 30.1 | 8.19 | 10 | 1.6 | 1.3 | 1 |
| phi-2 | 29.7 | 8.18 | 10 | 1.6 | 1.3 | 0.96 |
| codellama-7b+cot | 29.1 | 8.99 | 10 | 1.6 | 1.1 | 1.2 |
| starcoderbase-7b | 28.9 | 7.3 | 10 | 1.6 | 1.3 | 0.93 |
| deepseek-instruct-1.3b | 27.4 | 7.41 | 10 | 1.6 | 1.3 | 0.83 |
| deepseek-base-1.3b | 25.9 | 6.94 | 10 | 1.5 | 1.2 | 0.99 |
| phi-1.5 | 21.7 | 6.39 | 10 | 1.5 | 1.1 | 0.96 |
| phi-1 | 19.3 | 5.43 | 10 | 1.4 | 1.1 | 0.82 |