Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

Full table: summary.csv

benchmark_id size models SE(A) SE_x(A) SE(A-B) SE_x(A-B) no_solve tau- details
aime2024_cot 30 51 7.4 5.5 7.6 3 37 0 models | examples | data | more
aime2025_cot 30 52 7 6 6.7 4.2 50 0 models | examples | data | more
ap_cot 711 52 1.6 1.1 1.9 1.1 0 3.2 models | examples | data | more
bbh_cot 6511 45 0.55 0.37 0.68 0.36 2 1.6 models | examples | data | more
cruxeval_input_cot 800 52 1.6 1.1 1.9 0.85 2.4 1.4 models | examples | data | more
cruxeval_output_cot 800 52 1.6 1.1 1.8 0.83 4.4 0.5 models | examples | data | more
ds1000 1000 39 1.2 0.93 1.3 0.66 32 0.3 models | examples | data | more
gmat_cot 92 51 4.6 2.9 5.9 3.2 0 1.1 models | examples | data | more
gpqa_cot 448 51 2.1 1.2 2.7 1.4 0 21 models | examples | data | more
gre_physics_cot 75 51 5.2 3.4 6.7 3.9 0 4 models | examples | data | more
gsm8k_cot 1319 52 1.1 0.76 1.2 0.61 0.076 0.91 models | examples | data | more
gsm8k_plus_cot 10552 52 0.44 0.33 0.48 0.24 1.5 6.2 models | examples | data | more
human_eval 164 51 3.5 2.4 4.4 2.2 1.2 0 models | examples | data | more
human_eval_plus 164 51 3.6 2.5 4.3 2.3 6.7 0.61 models | examples | data | more
jeebench_chat_cot 515 49 1.7 1.1 2.1 0.99 19 6 models | examples | data | more
leetcode 180 51 3.1 2.3 3.5 1.8 39 0 models | examples | data | more
lsat_cot 403 42 2.3 1.7 2.8 1.7 0.25 11 models | examples | data | more
math500_cot 500 51 2 1.5 2.2 1 1.8 0.8 models | examples | data | more
math_cot 5000 48 0.63 0.44 0.73 0.34 1.9 0.72 models | examples | data | more
mbpp 500 35 2.1 1.7 2.2 1.2 3.8 2.2 models | examples | data | more
mgsm_cot 2750 51 0.83 0.65 0.93 0.57 0.8 1.3 models | examples | data | more
mmlu_pro_cot 12032 45 0.42 0.3 0.49 0.26 0.43 13 models | examples | data | more

Notes:


Generated on 2026-02-14 06:31:34