The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|
| O4-Mini (High) | 80.2 | 27.7 | 1 | 1.9 | NaN | NaN |
| O3 (High) | 75.8 | 24.5 | 1 | 2 | NaN | NaN |
| O4-Mini (Medium) | 74.2 | 22.4 | 1 | 2.1 | NaN | NaN |
| Gemini-2.5-Pro-06-05 | 73.6 | 21.8 | 1 | 2.1 | NaN | NaN |
| DeepSeek-R1-0528 | 73.1 | 21.7 | 10 | 2.1 | 1.8 | 1.1 |
| Gemini-2.5-Pro-05-06 | 71.8 | 20.3 | 1 | 2.1 | NaN | NaN |
| EXAONE-4.0-32B | 70 | 19.2 | 4 | 2.1 | 1.8 | 1.2 |
| OpenReasoning-Nemotron-32B | 69.8 | 19.2 | 10 | 2.2 | 1.8 | 1.2 |
| Gemini-2.5-Pro-03-25 | 67.8 | 18.2 | 1 | 2.2 | NaN | NaN |
| O3-Mini-2025-01-31 (High) | 67.4 | 18.3 | 1 | 2.2 | NaN | NaN |
| Grok-3-Mini (High) | 66.7 | 21.3 | 1 | 2.2 | NaN | NaN |
| Qwen3-235B-A22B | 65.9 | 16.5 | 1 | 2.2 | NaN | NaN |
| O4-Mini (Low) | 65.9 | 16.4 | 1 | 2.2 | NaN | NaN |
| XBai-o4-medium | 65 | 15.9 | 1 | 2.2 | NaN | NaN |
| O3-Mini-2025-01-31 (Med) | 63 | 15.2 | 1 | 2.3 | NaN | NaN |
| Gemini-2.5-Flash-05-20 | 61.9 | 14.1 | 1 | 2.3 | NaN | NaN |
| Gemini-2.5-Flash-04-17 | 60.6 | 13.4 | 1 | 2.3 | NaN | NaN |
| O3-Mini-2025-01-31 (Low) | 57 | 11.5 | 1 | 2.3 | NaN | NaN |
| Claude-Opus-4 (Thinking) | 56.6 | 10.8 | 1 | 2.3 | NaN | NaN |
| Claude-Sonnet-4 (Thinking) | 55.9 | 10.8 | 1 | 2.3 | NaN | NaN |
| QwQ-32B_temp | 55.7 | 10.8 | 1 | 2.3 | NaN | NaN |
| Claude-3.7-Sonnet | 50.4 | 8.48 | 1 | 2.3 | NaN | NaN |
| Gemini-Flash-2.0-Thinking-01-21 | 48.9 | 8.54 | 1 | 2.3 | NaN | NaN |
| Gemini-Flash-2.0-Thinking-12-19 | 48.7 | 8.84 | 1 | 2.3 | NaN | NaN |
| Claude-Sonnet-4 | 47.1 | 7.24 | 1 | 2.3 | NaN | NaN |
| Claude-Opus-4 | 46.9 | 7.19 | 1 | 2.3 | NaN | NaN |
| Claude-3.5-Sonnet-20241022 | 36.4 | 3.84 | 10 | 2.3 | 2.2 | 0.68 |
| Gemini-Flash-2.0-Exp | 31.3 | 2.61 | 1 | 2.2 | NaN | NaN |
| GPT-4O-2024-08-06 | 29.5 | 2.02 | 10 | 2.1 | 2 | 0.81 |
| GPT-4-Turbo-2024-04-09 | 28.7 | 1.91 | 10 | 2.1 | 2 | 0.82 |
| GPT-4O-mini-2024-07-18 | 27.5 | 1.72 | 10 | 2.1 | 2 | 0.62 |
| DeepSeek-V3 | 27.2 | 2.83 | 10 | 2.1 | 2 | 0.75 |
| Claude-3-Haiku | 20.2 | 1.04 | 10 | 1.9 | 1.8 | 0.54 |