model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
Kimi-k1.6-IOI-high 86 86 27.8 1 1.2 NaN NaN
O1-2024-12-17 (High) 83.2 83.2 26 1 1.3 NaN NaN
Kimi-k1.6-IOI 80.1 88.5 23.4 10 1.3 1.2 0.69
QwQ-Max-Preview 80 80 23.2 1 1.3 NaN NaN
O1-2024-12-17 (Med) 78.3 78.3 22.1 1 1.4 NaN NaN
DeepSeek-R1-Preview 77.9 87 21.5 10 1.4 1.2 0.69
Llama-3_1-Nemotron-Ultra-253B-v1 77.7 77.7 22.3 1 1.4 NaN NaN
O1-2024-12-17 (Low) 75.9 75.9 20.7 1 1.4 NaN NaN
DeepCoder-14B-Preview 73.3 73.3 20.6 1 1.5 NaN NaN
O1-Mini-2024-09-12 68.4 68.4 16.7 1 1.6 NaN NaN
Llama-3_1-Nemotron-Nano-8B-v1 64.4 64.4 15 1 1.6 NaN NaN
DeepSeek-R1-Lite-Preview 63.1 79.5 13.3 10 1.6 1.4 0.89
QwQ-32B-Preview 59.9 59.9 11.5 1 1.7 NaN NaN
O1-Preview-2024-09-12 55.6 55.6 10.5 1 1.7 NaN NaN
DeepSeek-V3 copy 54.5 63.6 9.84 10 1.7 1.6 0.61
MetaStone-L1-7B 54.1 74.3 10 10 1.7 1.4 0.96
Gemini-Flash-2.0-Thinking 51.1 64.1 8.08 10 1.7 1.5 0.77
Claude-3.5-Sonnet-20240620 48 51.4 7.45 10 1.7 1.6 0.4
GPT-4O-2024-05-13 43.4 50.7 5.25 10 1.7 1.6 0.53
Gemini-Pro-1.5-002 42.1 47.7 4.95 10 1.7 1.6 0.47
Mistral-Large 37.3 47.4 3.66 10 1.6 1.5 0.62
Gemini-Flash-1.5-002 36.1 39.2 3.36 10 1.6 1.6 0.34
Codestral-Latest 35.3 42.2 3.31 10 1.6 1.5 0.53
AzeroGPT-64b 22 40.7 1.15 10 1.4 1.1 0.86