CRUXEval-output-T0.8: by models

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
gpt-4-0613+cot	77.2	42	10	1.5	1.2	0.86
gpt-4-0613	68	33.8	10	1.6	1.6	0.53
gpt-3.5-turbo-0613+cot	56.5	25.7	10	1.8	1.3	1.1
deepseek-instruct-33b	48.4	18.3	10	1.8	1.5	0.88
gpt-3.5-turbo-0613	47.5	18	10	1.8	1.6	0.82
deepseek-base-33b	45.5	16.1	10	1.8	1.5	0.96
codetulu-2-34b	43.8	14.9	10	1.8	1.5	0.92
codellama-34b+cot	42.8	16.5	10	1.7	1.2	1.3
magicoder-ds-7b	41.7	13.8	10	1.7	1.5	0.94
wizard-34b	41.4	13.3	10	1.7	1.5	0.84
deepseek-instruct-6.7b	40.5	12.9	10	1.7	1.5	0.83
codellama-python-34b	39.7	12.3	10	1.7	1.5	0.87
codellama-34b	39.3	12.3	10	1.7	1.4	0.97
phind	38.9	12.2	10	1.7	1.5	0.89
deepseek-base-6.7b	38.3	11.8	10	1.7	1.4	0.97
wizard-13b	36.9	10.9	10	1.7	1.4	0.93
codellama-python-13b	36.4	10.7	10	1.7	1.4	0.96
mixtral-8x7b	36.3	10.8	10	1.7	1.4	0.99
codellama-13b	36.1	10.5	10	1.7	1.4	0.98
codellama-13b+cot	34.9	12.1	10	1.7	1.1	1.3
codellama-python-7b	32.4	8.91	10	1.7	1.4	0.95
codellama-7b	30.9	8.22	10	1.6	1.3	0.98
starcoderbase-16b	30.7	8.17	10	1.6	1.3	0.94
mistral-7b	30.1	8.19	10	1.6	1.3	1
phi-2	29.7	8.18	10	1.6	1.3	0.96
codellama-7b+cot	29.1	8.99	10	1.6	1.1	1.2
starcoderbase-7b	28.9	7.3	10	1.6	1.3	0.93
deepseek-instruct-1.3b	27.4	7.41	10	1.6	1.3	0.83
deepseek-base-1.3b	25.9	6.94	10	1.5	1.2	0.99
phi-1.5	21.7	6.39	10	1.5	1.1	0.96
phi-1	19.3	5.43	10	1.4	1.1	0.82