CRUXEval-input-T0.8: by models

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
gpt-4-0613+cot	73.7	42.8	10	1.6	1.2	1
gpt-4-0613	68	37.6	10	1.6	1.5	0.75
gpt-3.5-turbo-0613	45.7	21	10	1.8	1.4	1
phind	44.4	19.8	10	1.8	1.4	1.1
gpt-3.5-turbo-0613+cot	44.3	21.8	10	1.8	1.1	1.4
codetulu-2-34b	43.9	19.4	10	1.8	1.3	1.2
deepseek-instruct-33b	42.8	18.9	10	1.7	1.4	1.1
codellama-34b+cot	42.7	20.2	10	1.7	1.1	1.4
codellama-34b	41.1	17.7	10	1.7	1.3	1.2
magicoder-ds-7b	40.1	17.1	10	1.7	1.3	1.1
deepseek-base-33b	39.6	17.1	10	1.7	1.3	1.2
wizard-34b	38.7	16.1	10	1.7	1.4	1
codellama-python-34b	37	15.1	10	1.7	1.3	1.1
deepseek-base-6.7b	36.9	15.5	10	1.7	1.2	1.2
codellama-13b+cot	36.4	16.4	10	1.7	1	1.4
codellama-13b	35.2	14.3	10	1.7	1.2	1.2
deepseek-instruct-6.7b	34.7	14	10	1.7	1.3	1
mixtral-8x7b	32.8	13.2	10	1.7	1.2	1.2
codellama-python-13b	32.5	12.8	10	1.7	1.2	1.2
wizard-13b	32.2	12.5	10	1.7	1.3	1
codellama-python-7b	31.6	12.6	10	1.6	1.1	1.2
codellama-7b+cot	30	12.9	10	1.6	0.92	1.3
codellama-7b	28.4	10.6	10	1.6	1.1	1.1
mistral-7b	27.6	10.3	10	1.6	1.1	1.1
starcoderbase-16b	25.8	9.51	10	1.5	1.1	1.1
phi-2	25.7	10.4	10	1.5	1	1.1
starcoderbase-7b	25.4	9.18	10	1.5	1.1	1.1
deepseek-instruct-1.3b	24	9.07	10	1.5	1.2	0.92
deepseek-base-1.3b	22.5	8.25	10	1.5	1	1.1
phi-1.5	16.1	6.37	10	1.3	0.8	1
phi-1	12.6	4.03	10	1.2	0.96	0.67