The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | pass@count | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|---|
| Ante__Gemini-3-Pro-Preview | 64.7 | 82 | 26.9 | 5 | 5.1 | 4 | 3.1 |
| Mux__GPT-5.2 | 60.7 | 60.7 | 23.1 | 1 | 5.2 | NaN | NaN |
| Mux__Claude-Opus-4.5 | 58.4 | 58.4 | 21 | 1 | 5.2 | NaN | NaN |
| OpenCode__Claude-Opus-4.5 | 51.7 | 51.7 | 17.3 | 1 | 5.3 | NaN | NaN |
| MAYA__Claude-4.5-sonnet | 42.7 | 42.7 | 18.7 | 5 | 5.2 | 5.2 | 0 |
| Terminus2__GLM-4.7 | 33 | 50.6 | 8.85 | 5 | 5 | 3.9 | 3.1 |
| dakou__qwen3-coder-480b | 27.2 | 42.7 | 7.53 | 5 | 4.7 | 3.7 | 2.9 |