The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | pass@count | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|---|
| gpt-4-turbo-2024-04-09+cot | 75.7 | 85.5 | 36.1 | 2.7 | 1.5 | NaN | NaN |
| gpt-4o+cot | 75.6 | 83.6 | 35.5 | 2.7 | 1.5 | NaN | NaN |
| gpt-4-0613+cot | 75.5 | 91.6 | 35.4 | 10 | 1.5 | 1.2 | 0.93 |
| claude-3-opus-20240229+cot | 73.4 | 73.4 | 34.4 | 1 | 1.6 | NaN | NaN |
| gpt-4-0613 | 69.8 | 74.2 | 30.7 | 10 | 1.6 | 1.6 | 0.45 |
| gpt-4-turbo-2024-04-09 | 68.5 | 71.9 | 29.5 | 3 | 1.6 | 1.6 | 0.53 |
| gpt-4o | 65.1 | 68.1 | 27.4 | 3 | 1.7 | 1.6 | 0.52 |
| claude-3-opus-20240229 | 64.2 | 64.2 | 27.6 | 1 | 1.7 | NaN | NaN |
| gpt-3.5-turbo-0613+cot | 50.3 | 77.8 | 19.7 | 10 | 1.8 | 1.3 | 1.2 |
| codellama-34b+cot | 50.1 | 70.5 | 18.8 | 10 | 1.8 | 1.5 | 0.98 |
| codetulu-2-34b | 49.2 | 60.5 | 16.5 | 10 | 1.8 | 1.6 | 0.73 |
| gpt-3.5-turbo-0613 | 49 | 57 | 17.4 | 10 | 1.8 | 1.7 | 0.58 |
| codellama-13b+cot | 47.4 | 65.8 | 16.9 | 10 | 1.8 | 1.5 | 0.89 |
| codellama-34b | 47.2 | 59.2 | 15.3 | 10 | 1.8 | 1.6 | 0.74 |
| phind | 47.2 | 56.1 | 16 | 10 | 1.8 | 1.6 | 0.65 |
| deepseek-base-33b | 46.5 | 58 | 14.9 | 10 | 1.8 | 1.6 | 0.75 |
| deepseek-instruct-33b | 46.5 | 57 | 15.5 | 10 | 1.8 | 1.6 | 0.69 |
| codellama-python-34b | 43.9 | 56.4 | 13.9 | 10 | 1.8 | 1.6 | 0.74 |
| wizard-34b | 42.7 | 51.5 | 13.7 | 10 | 1.7 | 1.6 | 0.63 |
| codellama-13b | 42.5 | 54.9 | 12.9 | 10 | 1.7 | 1.6 | 0.8 |
| deepseek-base-6.7b | 41.9 | 54.4 | 12.8 | 10 | 1.7 | 1.6 | 0.74 |
| magicoder-ds-7b | 41.7 | 51.1 | 12.6 | 10 | 1.7 | 1.6 | 0.66 |
| codellama-7b+cot | 40.4 | 62 | 14.1 | 10 | 1.7 | 1.4 | 1 |
| codellama-python-13b | 39.7 | 53.1 | 11.6 | 10 | 1.7 | 1.5 | 0.79 |
| mixtral-8x7b | 39.3 | 53.1 | 12 | 10 | 1.7 | 1.5 | 0.79 |
| deepseek-instruct-6.7b | 37.4 | 46 | 10.8 | 10 | 1.7 | 1.6 | 0.64 |
| codellama-python-7b | 37.3 | 48.4 | 10.9 | 10 | 1.7 | 1.6 | 0.69 |
| wizard-13b | 36.5 | 45.9 | 10.4 | 10 | 1.7 | 1.6 | 0.63 |
| codellama-7b | 36 | 47.5 | 10.1 | 10 | 1.7 | 1.5 | 0.72 |
| mistral-7b | 35 | 46.9 | 9.81 | 10 | 1.7 | 1.5 | 0.73 |
| phi-2 | 31.6 | 44.4 | 9.38 | 10 | 1.6 | 1.5 | 0.74 |
| starcoderbase-16b | 31.3 | 44 | 8.28 | 10 | 1.6 | 1.5 | 0.74 |
| starcoderbase-7b | 29.7 | 40.6 | 7.63 | 10 | 1.6 | 1.5 | 0.69 |
| deepseek-base-1.3b | 27.8 | 37.2 | 7.34 | 10 | 1.6 | 1.5 | 0.63 |
| deepseek-instruct-1.3b | 27.2 | 35 | 7.74 | 10 | 1.6 | 1.5 | 0.58 |
| phi-1.5 | 23.2 | 36 | 7.8 | 10 | 1.5 | 1.3 | 0.74 |
| phi-1 | 13.1 | 17.6 | 3.66 | 10 | 1.2 | 1.1 | 0.44 |