model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
O4-Mini (High) 80.2 80.2 27.7 1 1.9 NaN NaN
O3 (High) 75.8 75.8 24.5 1 2 NaN NaN
O4-Mini (Medium) 74.2 74.2 22.4 1 2.1 NaN NaN
Gemini-2.5-Pro-06-05 73.6 73.6 21.8 1 2.1 NaN NaN
DeepSeek-R1-0528 73.1 87.9 21.7 10 2.1 1.8 1.1
Gemini-2.5-Pro-05-06 71.8 71.8 20.3 1 2.1 NaN NaN
EXAONE-4.0-32B 70 80.6 19.2 4 2.1 1.8 1.2
OpenReasoning-Nemotron-32B 69.8 83 19.2 10 2.2 1.8 1.2
Gemini-2.5-Pro-03-25 67.8 67.8 18.2 1 2.2 NaN NaN
O3-Mini-2025-01-31 (High) 67.4 67.4 18.3 1 2.2 NaN NaN
Grok-3-Mini (High) 66.7 66.7 21.3 1 2.2 NaN NaN
Qwen3-235B-A22B 65.9 65.9 16.5 1 2.2 NaN NaN
O4-Mini (Low) 65.9 65.9 16.4 1 2.2 NaN NaN
XBai-o4-medium 65 65 15.9 1 2.2 NaN NaN
O3-Mini-2025-01-31 (Med) 63 63 15.2 1 2.3 NaN NaN
Gemini-2.5-Flash-05-20 61.9 61.9 14.1 1 2.3 NaN NaN
Gemini-2.5-Flash-04-17 60.6 60.6 13.4 1 2.3 NaN NaN
O3-Mini-2025-01-31 (Low) 57 57 11.5 1 2.3 NaN NaN
Claude-Opus-4 (Thinking) 56.6 56.6 10.8 1 2.3 NaN NaN
Claude-Sonnet-4 (Thinking) 55.9 55.9 10.8 1 2.3 NaN NaN
QwQ-32B_temp 55.7 55.7 10.8 1 2.3 NaN NaN
Claude-3.7-Sonnet 50.4 50.4 8.48 1 2.3 NaN NaN
Gemini-Flash-2.0-Thinking-01-21 48.9 48.9 8.54 1 2.3 NaN NaN
Gemini-Flash-2.0-Thinking-12-19 48.7 48.7 8.84 1 2.3 NaN NaN
Claude-Sonnet-4 47.1 47.1 7.24 1 2.3 NaN NaN
Claude-Opus-4 46.9 46.9 7.19 1 2.3 NaN NaN
Claude-3.5-Sonnet-20241022 36.4 41.6 3.84 10 2.3 2.2 0.68
Gemini-Flash-2.0-Exp 31.3 31.3 2.61 1 2.2 NaN NaN
GPT-4O-2024-08-06 29.5 38.8 2.02 10 2.1 2 0.81
GPT-4-Turbo-2024-04-09 28.7 37.9 1.91 10 2.1 2 0.82
GPT-4O-mini-2024-07-18 27.5 34.4 1.72 10 2.1 2 0.62
DeepSeek-V3 27.2 35.7 2.83 10 2.1 2 0.75
Claude-3-Haiku 20.2 24 1.04 10 1.9 1.8 0.54