The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|
| Kimi-k1.6-IOI-high | 86 | 27.8 | 1 | 1.2 | NaN | NaN |
| O1-2024-12-17 (High) | 83.2 | 26 | 1 | 1.3 | NaN | NaN |
| Kimi-k1.6-IOI | 80.1 | 23.4 | 10 | 1.3 | 1.2 | 0.69 |
| QwQ-Max-Preview | 80 | 23.2 | 1 | 1.3 | NaN | NaN |
| O1-2024-12-17 (Med) | 78.3 | 22.1 | 1 | 1.4 | NaN | NaN |
| DeepSeek-R1-Preview | 77.9 | 21.5 | 10 | 1.4 | 1.2 | 0.69 |
| Llama-3_1-Nemotron-Ultra-253B-v1 | 77.7 | 22.3 | 1 | 1.4 | NaN | NaN |
| O1-2024-12-17 (Low) | 75.9 | 20.7 | 1 | 1.4 | NaN | NaN |
| DeepCoder-14B-Preview | 73.3 | 20.6 | 1 | 1.5 | NaN | NaN |
| O1-Mini-2024-09-12 | 68.4 | 16.7 | 1 | 1.6 | NaN | NaN |
| Llama-3_1-Nemotron-Nano-8B-v1 | 64.4 | 15 | 1 | 1.6 | NaN | NaN |
| DeepSeek-R1-Lite-Preview | 63.1 | 13.3 | 10 | 1.6 | 1.4 | 0.89 |
| QwQ-32B-Preview | 59.9 | 11.5 | 1 | 1.7 | NaN | NaN |
| O1-Preview-2024-09-12 | 55.6 | 10.5 | 1 | 1.7 | NaN | NaN |
| DeepSeek-V3 copy | 54.5 | 9.84 | 10 | 1.7 | 1.6 | 0.61 |
| MetaStone-L1-7B | 54.1 | 10 | 10 | 1.7 | 1.4 | 0.96 |
| Gemini-Flash-2.0-Thinking | 51.1 | 8.08 | 10 | 1.7 | 1.5 | 0.77 |
| Claude-3.5-Sonnet-20240620 | 48 | 7.45 | 10 | 1.7 | 1.6 | 0.4 |
| GPT-4O-2024-05-13 | 43.4 | 5.25 | 10 | 1.7 | 1.6 | 0.53 |
| Gemini-Pro-1.5-002 | 42.1 | 4.95 | 10 | 1.7 | 1.6 | 0.47 |
| Mistral-Large | 37.3 | 3.66 | 10 | 1.6 | 1.5 | 0.62 |
| Gemini-Flash-1.5-002 | 36.1 | 3.36 | 10 | 1.6 | 1.6 | 0.34 |
| Codestral-Latest | 35.3 | 3.31 | 10 | 1.6 | 1.5 | 0.53 |
| AzeroGPT-64b | 22 | 1.15 | 10 | 1.4 | 1.1 | 0.86 |