The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|
| gpt-4-turbo-2024-04-09+cot | 82 | 38.7 | 3 | 1.4 | 1.2 | 0.69 |
| claude-3-opus-20240229+cot | 82 | 38.8 | 1 | 1.4 | NaN | NaN |
| gpt-4-0613+cot | 77.1 | 34.7 | 10 | 1.5 | 1.3 | 0.7 |
| gpt-4o+cot | 76 | 37.2 | 3 | 1.5 | NaN | NaN |
| gpt-4o | 70 | 28.9 | 3 | 1.6 | 1.6 | 0.35 |
| gpt-4-0613 | 68.7 | 27.7 | 10 | 1.6 | 1.6 | 0.34 |
| gpt-4-turbo-2024-04-09 | 67.7 | 26.9 | 3 | 1.7 | 1.6 | 0.37 |
| claude-3-opus-20240229 | 65.8 | 25.5 | 1 | 1.7 | NaN | NaN |
| gpt-3.5-turbo-0613+cot | 59 | 22 | 10 | 1.7 | 1.5 | 0.93 |
| deepseek-instruct-33b | 49.9 | 14.5 | 10 | 1.8 | 1.7 | 0.49 |
| gpt-3.5-turbo-0613 | 49.4 | 14.7 | 10 | 1.8 | 1.7 | 0.42 |
| deepseek-base-33b | 48.6 | 13.4 | 10 | 1.8 | 1.7 | 0.54 |
| codetulu-2-34b | 45.8 | 11.9 | 10 | 1.8 | 1.7 | 0.52 |
| magicoder-ds-7b | 44.4 | 11.4 | 10 | 1.8 | 1.7 | 0.54 |
| codellama-34b+cot | 43.6 | 13.3 | 10 | 1.8 | 1.4 | 1.1 |
| deepseek-base-6.7b | 43.5 | 10.9 | 10 | 1.8 | 1.6 | 0.6 |
| wizard-34b | 43.4 | 10.8 | 10 | 1.8 | 1.7 | 0.47 |
| codellama-34b | 42.4 | 10.2 | 10 | 1.7 | 1.6 | 0.59 |
| codellama-python-34b | 41.4 | 9.75 | 10 | 1.7 | 1.7 | 0.51 |
| wizard-13b | 41.3 | 9.73 | 10 | 1.7 | 1.7 | 0.53 |
| deepseek-instruct-6.7b | 41.2 | 9.74 | 10 | 1.7 | 1.7 | 0.43 |
| mixtral-8x7b | 40.5 | 9.27 | 10 | 1.7 | 1.6 | 0.62 |
| codellama-python-13b | 39.8 | 8.98 | 10 | 1.7 | 1.6 | 0.54 |
| codellama-13b | 39.7 | 8.98 | 10 | 1.7 | 1.6 | 0.59 |
| phind | 39.7 | 8.96 | 10 | 1.7 | 1.7 | 0.48 |
| codellama-13b+cot | 36 | 10.1 | 10 | 1.7 | 1.3 | 1.1 |
| codellama-python-7b | 35.9 | 8.02 | 10 | 1.7 | 1.6 | 0.56 |
| mistral-7b | 34.3 | 7.07 | 10 | 1.7 | 1.6 | 0.59 |
| codellama-7b | 34.2 | 6.81 | 10 | 1.7 | 1.6 | 0.6 |
| starcoderbase-16b | 34.2 | 7.2 | 10 | 1.7 | 1.6 | 0.57 |
| phi-2 | 33.5 | 7.35 | 10 | 1.7 | 1.6 | 0.58 |
| starcoderbase-7b | 32.2 | 6.28 | 10 | 1.7 | 1.6 | 0.5 |
| deepseek-base-1.3b | 31 | 6.66 | 10 | 1.6 | 1.5 | 0.6 |
| codellama-7b+cot | 29.9 | 7.62 | 10 | 1.6 | 1.2 | 1.1 |
| deepseek-instruct-1.3b | 28.7 | 6 | 10 | 1.6 | 1.5 | 0.51 |
| phi-1.5 | 27.5 | 6.77 | 10 | 1.6 | 1.5 | 0.59 |
| phi-1 | 21.7 | 4.99 | 10 | 1.5 | 1.4 | 0.48 |