Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

Raw data: summary.csv

benchmark_id size models SE(A) SE_x(A) SE(A-B) SE_x(A-B) corr(A,B) no_solve tau- details
aime2024_cot 30 50 7.6 5.7 8 3.7 77 3.3 0 models | examples | data | raw
aime2025_cot 30 50 7 5.6 6.8 3.3 82 3.3 13 models | examples | data | raw
human_eval_plus 164 49 3.6 2.7 4.3 2.4 61 6.1 1.2 models | examples | data | raw
math500_cot 500 48 2 1.5 2.1 0.94 79 0.4 1.2 models | examples | data | raw

Notes: