The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|
| gpt-4-turbo-2024-04-09+cot | 75.7 | 36.1 | 2.7 | 1.5 | NaN | NaN |
| gpt-4o+cot | 75.6 | 35.5 | 2.7 | 1.5 | NaN | NaN |
| gpt-4-0613+cot | 75.5 | 35.4 | 10 | 1.5 | 1.2 | 0.93 |
| claude-3-opus-20240229+cot | 73.4 | 34.4 | 1 | 1.6 | NaN | NaN |
| gpt-4-0613 | 69.8 | 30.7 | 10 | 1.6 | 1.6 | 0.45 |
| gpt-4-turbo-2024-04-09 | 68.5 | 29.5 | 3 | 1.6 | 1.6 | 0.53 |
| gpt-4o | 65.1 | 27.4 | 3 | 1.7 | 1.6 | 0.52 |
| claude-3-opus-20240229 | 64.2 | 27.6 | 1 | 1.7 | NaN | NaN |
| gpt-3.5-turbo-0613+cot | 50.3 | 19.7 | 10 | 1.8 | 1.3 | 1.2 |
| codellama-34b+cot | 50.1 | 18.8 | 10 | 1.8 | 1.5 | 0.98 |
| codetulu-2-34b | 49.2 | 16.5 | 10 | 1.8 | 1.6 | 0.73 |
| gpt-3.5-turbo-0613 | 49 | 17.4 | 10 | 1.8 | 1.7 | 0.58 |
| codellama-13b+cot | 47.4 | 16.9 | 10 | 1.8 | 1.5 | 0.89 |
| codellama-34b | 47.2 | 15.3 | 10 | 1.8 | 1.6 | 0.74 |
| phind | 47.2 | 16 | 10 | 1.8 | 1.6 | 0.65 |
| deepseek-base-33b | 46.5 | 14.9 | 10 | 1.8 | 1.6 | 0.75 |
| deepseek-instruct-33b | 46.5 | 15.5 | 10 | 1.8 | 1.6 | 0.69 |
| codellama-python-34b | 43.9 | 13.9 | 10 | 1.8 | 1.6 | 0.74 |
| wizard-34b | 42.7 | 13.7 | 10 | 1.7 | 1.6 | 0.63 |
| codellama-13b | 42.5 | 12.9 | 10 | 1.7 | 1.6 | 0.8 |
| deepseek-base-6.7b | 41.9 | 12.8 | 10 | 1.7 | 1.6 | 0.74 |
| magicoder-ds-7b | 41.7 | 12.6 | 10 | 1.7 | 1.6 | 0.66 |
| codellama-7b+cot | 40.4 | 14.1 | 10 | 1.7 | 1.4 | 1 |
| codellama-python-13b | 39.7 | 11.6 | 10 | 1.7 | 1.5 | 0.79 |
| mixtral-8x7b | 39.3 | 12 | 10 | 1.7 | 1.5 | 0.79 |
| deepseek-instruct-6.7b | 37.4 | 10.8 | 10 | 1.7 | 1.6 | 0.64 |
| codellama-python-7b | 37.3 | 10.9 | 10 | 1.7 | 1.6 | 0.69 |
| wizard-13b | 36.5 | 10.4 | 10 | 1.7 | 1.6 | 0.63 |
| codellama-7b | 36 | 10.1 | 10 | 1.7 | 1.5 | 0.72 |
| mistral-7b | 35 | 9.81 | 10 | 1.7 | 1.5 | 0.73 |
| phi-2 | 31.6 | 9.38 | 10 | 1.6 | 1.5 | 0.74 |
| starcoderbase-16b | 31.3 | 8.28 | 10 | 1.6 | 1.5 | 0.74 |
| starcoderbase-7b | 29.7 | 7.63 | 10 | 1.6 | 1.5 | 0.69 |
| deepseek-base-1.3b | 27.8 | 7.34 | 10 | 1.6 | 1.5 | 0.63 |
| deepseek-instruct-1.3b | 27.2 | 7.74 | 10 | 1.6 | 1.5 | 0.58 |
| phi-1.5 | 23.2 | 7.8 | 10 | 1.5 | 1.3 | 0.74 |
| phi-1 | 13.1 | 3.66 | 10 | 1.2 | 1.1 | 0.44 |