Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

Raw data: summary.csv

benchmark_id size models SE(A) SE_x(A) SE(A-B) SE_x(A-B) corr(A,B) no_solve tau- details
aime2024_cot 30 51 7.4 5.5 7.6 3 65 37 0 models | examples | data | raw
aime2025_cot 30 52 7 6 6.7 4.2 63 50 0 models | examples | data | raw
ap_cot 711 52 1.6 1.1 1.9 1.1 43 0 3.2 models | examples | data | raw
bbh_cot 6511 45 0.55 0.37 0.68 0.36 39 2 1.6 models | examples | data | raw
cruxeval_input_cot 800 52 1.6 1.1 1.9 0.85 49 2.4 1.4 models | examples | data | raw
cruxeval_output_cot 800 52 1.6 1.1 1.8 0.83 54 4.4 0.5 models | examples | data | raw
ds1000 1000 39 1.2 0.93 1.3 0.66 58 32 0.3 models | examples | data | raw
gmat_cot 92 51 4.6 2.9 5.9 3.2 33 0 1.1 models | examples | data | raw
gpqa_cot 448 51 2.1 1.2 2.7 1.4 23 0 21 models | examples | data | raw
gre_physics_cot 75 51 5.2 3.4 6.7 3.9 26 0 4 models | examples | data | raw
gsm8k_cot 1319 52 1.1 0.76 1.2 0.61 49 0.076 0.91 models | examples | data | raw
gsm8k_plus_cot 10552 52 0.44 0.33 0.48 0.24 58 1.5 6.2 models | examples | data | raw
human_eval 164 51 3.5 2.4 4.4 2.2 41 1.2 0 models | examples | data | raw
human_eval_plus 164 51 3.6 2.5 4.3 2.3 44 6.7 0.61 models | examples | data | raw
jeebench_chat_cot 515 49 1.7 1.1 2.1 0.99 36 19 6 models | examples | data | raw
leetcode 180 51 3.1 2.3 3.5 1.8 47 39 0 models | examples | data | raw
lsat_cot 403 42 2.3 1.7 2.8 1.7 41 0.25 11 models | examples | data | raw
math500_cot 500 51 2 1.5 2.2 1 58 1.8 0.8 models | examples | data | raw
math_cot 5000 48 0.63 0.44 0.73 0.34 50 1.9 0.72 models | examples | data | raw
mbpp 500 35 2.1 1.7 2.2 1.2 64 3.8 2.2 models | examples | data | raw
mgsm_cot 2750 51 0.83 0.65 0.93 0.57 54 0.8 1.3 models | examples | data | raw
mmlu_pro_cot 12032 45 0.42 0.3 0.49 0.26 44 0.43 13 models | examples | data | raw

Notes: