lcb_codegen_v6: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
O4-Mini (High) 87.3 22.7 1 1 NaN NaN
O3 (High) 84.7 20.7 1 1.1 NaN NaN
O4-Mini (Medium) 84.5 20 1 1.1 NaN NaN
DeepSeek-R1-0528 84.4 19.9 10 1.1 0.95 0.58
Gemini-2.5-Pro-06-05 84.3 19.6 1 1.1 NaN NaN
Gemini-2.5-Pro-05-06 82.7 18.5 1 1.2 NaN NaN
Gemini-2.5-Pro-03-25 81.5 17.9 1 1.2 NaN NaN
OpenReasoning-Nemotron-32B 81 17.8 10 1.2 1 0.65
EXAONE-4.0-32B 80.9 17.5 4 1.2 1 0.64
Qwen3-235B-A22B 80.4 17 1 1.2 NaN NaN
XBai-o4-medium 80.1 16.7 1 1.2 NaN NaN
Grok-3-Mini (High) 78.1 17.7 1 1.3 NaN NaN
O3-Mini-2025-01-31 (High) 77.7 16.1 1 1.3 NaN NaN
O4-Mini (Low) 77.4 15.7 1 1.3 NaN NaN
Gemini-2.5-Flash-05-20 76.2 14.7 1 1.3 NaN NaN
O3-Mini-2025-01-31 (Med) 75.4 14.7 1 1.3 NaN NaN
Gemini-2.5-Flash-04-17 75.1 14.4 1 1.3 NaN NaN
QwQ-32B_temp 73.5 13.2 1 1.4 NaN NaN
O3-Mini-2025-01-31 (Low) 70.6 12.3 1 1.4 NaN NaN
Claude-Opus-4 (Thinking) 70.4 11.6 1 1.4 NaN NaN
Claude-Sonnet-4 (Thinking) 68.5 10.8 1 1.4 NaN NaN
Claude-3.7-Sonnet 63.5 8.97 1 1.5 NaN NaN
Claude-Opus-4 62.4 8.62 1 1.5 NaN NaN
Claude-Sonnet-4 59.4 7.75 1 1.5 NaN NaN
Gemini-Flash-2.0-Thinking-12-19 56.5 7.43 1 1.5 NaN NaN
Gemini-Flash-2.0-Thinking-01-21 55.7 6.9 1 1.5 NaN NaN
DeepSeek-V3 49.6 5.51 10 1.5 1.4 0.54
Claude-3.5-Sonnet-20241022 48.7 4.73 10 1.5 1.5 0.48
Gemini-Flash-2.0-Exp 41.8 3.04 1 1.5 NaN NaN
GPT-4O-2024-08-06 38.3 2.49 10 1.5 1.4 0.59
GPT-4-Turbo-2024-04-09 37.3 2.28 10 1.5 1.4 0.58
GPT-4O-mini-2024-07-18 35.5 1.96 10 1.5 1.4 0.48
Claude-3-Haiku 22.5 0.692 10 1.3 1.2 0.38