CRUXEval-input-T0.2: by models

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
gpt-4-turbo-2024-04-09+cot	75.7	36.1	2.7	1.5	NaN	NaN
gpt-4o+cot	75.6	35.5	2.7	1.5	NaN	NaN
gpt-4-0613+cot	75.5	35.4	10	1.5	1.2	0.93
claude-3-opus-20240229+cot	73.4	34.4	1	1.6	NaN	NaN
gpt-4-0613	69.8	30.7	10	1.6	1.6	0.45
gpt-4-turbo-2024-04-09	68.5	29.5	3	1.6	1.6	0.53
gpt-4o	65.1	27.4	3	1.7	1.6	0.52
claude-3-opus-20240229	64.2	27.6	1	1.7	NaN	NaN
gpt-3.5-turbo-0613+cot	50.3	19.7	10	1.8	1.3	1.2
codellama-34b+cot	50.1	18.8	10	1.8	1.5	0.98
codetulu-2-34b	49.2	16.5	10	1.8	1.6	0.73
gpt-3.5-turbo-0613	49	17.4	10	1.8	1.7	0.58
codellama-13b+cot	47.4	16.9	10	1.8	1.5	0.89
codellama-34b	47.2	15.3	10	1.8	1.6	0.74
phind	47.2	16	10	1.8	1.6	0.65
deepseek-base-33b	46.5	14.9	10	1.8	1.6	0.75
deepseek-instruct-33b	46.5	15.5	10	1.8	1.6	0.69
codellama-python-34b	43.9	13.9	10	1.8	1.6	0.74
wizard-34b	42.7	13.7	10	1.7	1.6	0.63
codellama-13b	42.5	12.9	10	1.7	1.6	0.8
deepseek-base-6.7b	41.9	12.8	10	1.7	1.6	0.74
magicoder-ds-7b	41.7	12.6	10	1.7	1.6	0.66
codellama-7b+cot	40.4	14.1	10	1.7	1.4	1
codellama-python-13b	39.7	11.6	10	1.7	1.5	0.79
mixtral-8x7b	39.3	12	10	1.7	1.5	0.79
deepseek-instruct-6.7b	37.4	10.8	10	1.7	1.6	0.64
codellama-python-7b	37.3	10.9	10	1.7	1.6	0.69
wizard-13b	36.5	10.4	10	1.7	1.6	0.63
codellama-7b	36	10.1	10	1.7	1.5	0.72
mistral-7b	35	9.81	10	1.7	1.5	0.73
phi-2	31.6	9.38	10	1.6	1.5	0.74
starcoderbase-16b	31.3	8.28	10	1.6	1.5	0.74
starcoderbase-7b	29.7	7.63	10	1.6	1.5	0.69
deepseek-base-1.3b	27.8	7.34	10	1.6	1.5	0.63
deepseek-instruct-1.3b	27.2	7.74	10	1.6	1.5	0.58
phi-1.5	23.2	7.8	10	1.5	1.3	0.74
phi-1	13.1	3.66	10	1.2	1.1	0.44