lcb_codegen_v6_080124: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
O4-Mini (High) 80.2 27.7 1 1.9 NaN NaN
O3 (High) 75.8 24.5 1 2 NaN NaN
O4-Mini (Medium) 74.2 22.4 1 2.1 NaN NaN
Gemini-2.5-Pro-06-05 73.6 21.8 1 2.1 NaN NaN
DeepSeek-R1-0528 73.1 21.7 10 2.1 1.8 1.1
Gemini-2.5-Pro-05-06 71.8 20.3 1 2.1 NaN NaN
EXAONE-4.0-32B 70 19.2 4 2.1 1.8 1.2
OpenReasoning-Nemotron-32B 69.8 19.2 10 2.2 1.8 1.2
Gemini-2.5-Pro-03-25 67.8 18.2 1 2.2 NaN NaN
O3-Mini-2025-01-31 (High) 67.4 18.3 1 2.2 NaN NaN
Grok-3-Mini (High) 66.7 21.3 1 2.2 NaN NaN
Qwen3-235B-A22B 65.9 16.5 1 2.2 NaN NaN
O4-Mini (Low) 65.9 16.4 1 2.2 NaN NaN
XBai-o4-medium 65 15.9 1 2.2 NaN NaN
O3-Mini-2025-01-31 (Med) 63 15.2 1 2.3 NaN NaN
Gemini-2.5-Flash-05-20 61.9 14.1 1 2.3 NaN NaN
Gemini-2.5-Flash-04-17 60.6 13.4 1 2.3 NaN NaN
O3-Mini-2025-01-31 (Low) 57 11.5 1 2.3 NaN NaN
Claude-Opus-4 (Thinking) 56.6 10.8 1 2.3 NaN NaN
Claude-Sonnet-4 (Thinking) 55.9 10.8 1 2.3 NaN NaN
QwQ-32B_temp 55.7 10.8 1 2.3 NaN NaN
Claude-3.7-Sonnet 50.4 8.48 1 2.3 NaN NaN
Gemini-Flash-2.0-Thinking-01-21 48.9 8.54 1 2.3 NaN NaN
Gemini-Flash-2.0-Thinking-12-19 48.7 8.84 1 2.3 NaN NaN
Claude-Sonnet-4 47.1 7.24 1 2.3 NaN NaN
Claude-Opus-4 46.9 7.19 1 2.3 NaN NaN
Claude-3.5-Sonnet-20241022 36.4 3.84 10 2.3 2.2 0.68
Gemini-Flash-2.0-Exp 31.3 2.61 1 2.2 NaN NaN
GPT-4O-2024-08-06 29.5 2.02 10 2.1 2 0.81
GPT-4-Turbo-2024-04-09 28.7 1.91 10 2.1 2 0.82
GPT-4O-mini-2024-07-18 27.5 1.72 10 2.1 2 0.62
DeepSeek-V3 27.2 2.83 10 2.1 2 0.75
Claude-3-Haiku 20.2 1.04 10 1.9 1.8 0.54