lcb_codegen_v5: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
Kimi-k1.6-IOI-high 86 27.8 1 1.2 NaN NaN
O1-2024-12-17 (High) 83.2 26 1 1.3 NaN NaN
Kimi-k1.6-IOI 80.1 23.4 10 1.3 1.2 0.69
QwQ-Max-Preview 80 23.2 1 1.3 NaN NaN
O1-2024-12-17 (Med) 78.3 22.1 1 1.4 NaN NaN
DeepSeek-R1-Preview 77.9 21.5 10 1.4 1.2 0.69
Llama-3_1-Nemotron-Ultra-253B-v1 77.7 22.3 1 1.4 NaN NaN
O1-2024-12-17 (Low) 75.9 20.7 1 1.4 NaN NaN
DeepCoder-14B-Preview 73.3 20.6 1 1.5 NaN NaN
O1-Mini-2024-09-12 68.4 16.7 1 1.6 NaN NaN
Llama-3_1-Nemotron-Nano-8B-v1 64.4 15 1 1.6 NaN NaN
DeepSeek-R1-Lite-Preview 63.1 13.3 10 1.6 1.4 0.89
QwQ-32B-Preview 59.9 11.5 1 1.7 NaN NaN
O1-Preview-2024-09-12 55.6 10.5 1 1.7 NaN NaN
DeepSeek-V3 copy 54.5 9.84 10 1.7 1.6 0.61
MetaStone-L1-7B 54.1 10 10 1.7 1.4 0.96
Gemini-Flash-2.0-Thinking 51.1 8.08 10 1.7 1.5 0.77
Claude-3.5-Sonnet-20240620 48 7.45 10 1.7 1.6 0.4
GPT-4O-2024-05-13 43.4 5.25 10 1.7 1.6 0.53
Gemini-Pro-1.5-002 42.1 4.95 10 1.7 1.6 0.47
Mistral-Large 37.3 3.66 10 1.6 1.5 0.62
Gemini-Flash-1.5-002 36.1 3.36 10 1.6 1.6 0.34
Codestral-Latest 35.3 3.31 10 1.6 1.5 0.53
AzeroGPT-64b 22 1.15 10 1.4 1.1 0.86