model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
gpt-4-turbo-2024-04-09+cot 82 87.6 38.7 3 1.4 1.2 0.69
claude-3-opus-20240229+cot 82 82 38.8 1 1.4 NaN NaN
gpt-4-0613+cot 77.1 86.5 34.7 10 1.5 1.3 0.7
gpt-4o+cot 76 91 37.2 3 1.5 NaN NaN
gpt-4o 70 71.4 28.9 3 1.6 1.6 0.35
gpt-4-0613 68.7 71.1 27.7 10 1.6 1.6 0.34
gpt-4-turbo-2024-04-09 67.7 69.4 26.9 3 1.7 1.6 0.37
claude-3-opus-20240229 65.8 65.8 25.5 1 1.7 NaN NaN
gpt-3.5-turbo-0613+cot 59 76.2 22 10 1.7 1.5 0.93
deepseek-instruct-33b 49.9 54.9 14.5 10 1.8 1.7 0.49
gpt-3.5-turbo-0613 49.4 53.9 14.7 10 1.8 1.7 0.42
deepseek-base-33b 48.6 54.8 13.4 10 1.8 1.7 0.54
codetulu-2-34b 45.8 52.4 11.9 10 1.8 1.7 0.52
magicoder-ds-7b 44.4 50.6 11.4 10 1.8 1.7 0.54
codellama-34b+cot 43.6 69.8 13.3 10 1.8 1.4 1.1
deepseek-base-6.7b 43.5 51.2 10.9 10 1.8 1.6 0.6
wizard-34b 43.4 47.5 10.8 10 1.8 1.7 0.47
codellama-34b 42.4 50 10.2 10 1.7 1.6 0.59
codellama-python-34b 41.4 46.9 9.75 10 1.7 1.7 0.51
wizard-13b 41.3 47 9.73 10 1.7 1.7 0.53
deepseek-instruct-6.7b 41.2 45.1 9.74 10 1.7 1.7 0.43
mixtral-8x7b 40.5 49.5 9.27 10 1.7 1.6 0.62
codellama-python-13b 39.8 46.4 8.98 10 1.7 1.6 0.54
codellama-13b 39.7 48.1 8.98 10 1.7 1.6 0.59
phind 39.7 44.8 8.96 10 1.7 1.7 0.48
codellama-13b+cot 36 64.9 10.1 10 1.7 1.3 1.1
codellama-python-7b 35.9 42.9 8.02 10 1.7 1.6 0.56
mistral-7b 34.3 42.1 7.07 10 1.7 1.6 0.59
codellama-7b 34.2 41.9 6.81 10 1.7 1.6 0.6
starcoderbase-16b 34.2 41.5 7.2 10 1.7 1.6 0.57
phi-2 33.5 42.6 7.35 10 1.7 1.6 0.58
starcoderbase-7b 32.2 38.2 6.28 10 1.7 1.6 0.5
deepseek-base-1.3b 31 38.5 6.66 10 1.6 1.5 0.6
codellama-7b+cot 29.9 58.8 7.62 10 1.6 1.2 1.1
deepseek-instruct-1.3b 28.7 34.6 6 10 1.6 1.5 0.51
phi-1.5 27.5 35.2 6.77 10 1.6 1.5 0.59
phi-1 21.7 26.5 4.99 10 1.5 1.4 0.48