Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

Full table: summary.csv

benchmark_id size models SE(A) SE_x(A) SE(A-B) SE_x(A-B) no_solve tau- details
aime2024_cot 30 52 7.3 5.7 8.1 4.7 20 0 models | examples | data | more
aime2025_cot 30 52 7.2 5.9 6.7 3.4 27 3.3 models | examples | data | more
ap_cot 711 52 1.6 1.2 1.9 1.1 0 3.1 models | examples | data | more
bbh_cot 6511 45 0.56 0.41 0.68 0.4 1.2 2.1 models | examples | data | more
cruxeval_input_cot 800 52 1.6 1.2 1.9 1 0.62 1.5 models | examples | data | more
cruxeval_output_cot 800 52 1.6 1.2 1.8 0.93 2.5 1.1 models | examples | data | more
ds1000 1000 39 1.3 1 1.3 0.8 27 0.9 models | examples | data | more
gmat_cot 92 51 4.6 3.1 5.8 3.4 0 1.1 models | examples | data | more
gpqa_cot 448 51 2.1 1.3 2.8 1.6 0 21 models | examples | data | more
gre_physics_cot 75 51 5.2 3.7 6.6 4.2 0 5.3 models | examples | data | more
gsm8k_cot 1319 52 1.1 0.78 1.1 0.64 0.15 0.99 models | examples | data | more
gsm8k_plus_cot 10552 52 0.44 0.34 0.44 0.25 0.78 5.6 models | examples | data | more
human_eval 164 51 3.5 2.6 4.4 2.6 0.61 0 models | examples | data | more
human_eval_plus 164 51 3.6 2.8 4.3 2.6 6.1 1.2 models | examples | data | more
jeebench_chat_cot 515 49 1.6 1.1 2 1.2 14 5.2 models | examples | data | more
leetcode 180 51 3 2.4 3.4 2.2 29 0 models | examples | data | more
lsat_cot 403 45 2.3 1.8 2.7 1.8 0.25 10 models | examples | data | more
math500_cot 500 51 2 1.5 2.1 1.1 0.6 1 models | examples | data | more
math_cot 5000 49 0.63 0.47 0.73 0.37 1.1 0.64 models | examples | data | more
mbpp 500 35 2.1 1.7 2.1 1.4 3 2.4 models | examples | data | more
mgsm_cot 2750 52 0.83 0.66 0.9 0.53 0.36 1.6 models | examples | data | more
mmlu_pro_cot 12032 45 0.42 0.31 0.49 0.29 0.066 13 models | examples | data | more

Notes:


Generated on 2026-02-14 06:32:13