lcb_codegen_v5: by models

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
Kimi-k1.6-IOI-high	86	27.8	1	1.2	NaN	NaN
O1-2024-12-17 (High)	83.2	26	1	1.3	NaN	NaN
Kimi-k1.6-IOI	80.1	23.4	10	1.3	1.2	0.69
QwQ-Max-Preview	80	23.2	1	1.3	NaN	NaN
O1-2024-12-17 (Med)	78.3	22.1	1	1.4	NaN	NaN
DeepSeek-R1-Preview	77.9	21.5	10	1.4	1.2	0.69
Llama-3_1-Nemotron-Ultra-253B-v1	77.7	22.3	1	1.4	NaN	NaN
O1-2024-12-17 (Low)	75.9	20.7	1	1.4	NaN	NaN
DeepCoder-14B-Preview	73.3	20.6	1	1.5	NaN	NaN
O1-Mini-2024-09-12	68.4	16.7	1	1.6	NaN	NaN
Llama-3_1-Nemotron-Nano-8B-v1	64.4	15	1	1.6	NaN	NaN
DeepSeek-R1-Lite-Preview	63.1	13.3	10	1.6	1.4	0.89
QwQ-32B-Preview	59.9	11.5	1	1.7	NaN	NaN
O1-Preview-2024-09-12	55.6	10.5	1	1.7	NaN	NaN
DeepSeek-V3 copy	54.5	9.84	10	1.7	1.6	0.61
MetaStone-L1-7B	54.1	10	10	1.7	1.4	0.96
Gemini-Flash-2.0-Thinking	51.1	8.08	10	1.7	1.5	0.77
Claude-3.5-Sonnet-20240620	48	7.45	10	1.7	1.6	0.4
GPT-4O-2024-05-13	43.4	5.25	10	1.7	1.6	0.53
Gemini-Pro-1.5-002	42.1	4.95	10	1.7	1.6	0.47
Mistral-Large	37.3	3.66	10	1.6	1.5	0.62
Gemini-Flash-1.5-002	36.1	3.36	10	1.6	1.6	0.34
Codestral-Latest	35.3	3.31	10	1.6	1.5	0.53
AzeroGPT-64b	22	1.15	10	1.4	1.1	0.86