Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

Raw data: summary.csv

benchmark_id size models SE(A) SE_x(A) SE(A-B) SE_x(A-B) corr(A,B) no_solve tau- details
aime2024_cot 30 52 7.3 5.7 8.1 4.7 61 20 0 models | examples | data | raw
aime2025_cot 30 52 7.2 5.9 6.7 3.4 80 27 3.3 models | examples | data | raw
ap_cot 711 52 1.6 1.2 1.9 1.1 51 0 3.1 models | examples | data | raw
bbh_cot 6511 45 0.56 0.41 0.68 0.4 48 1.2 2.1 models | examples | data | raw
cruxeval_input_cot 800 52 1.6 1.2 1.9 1 60 0.62 1.5 models | examples | data | raw
cruxeval_output_cot 800 52 1.6 1.2 1.8 0.93 65 2.5 1.1 models | examples | data | raw
ds1000 1000 39 1.3 1 1.3 0.8 65 27 0.9 models | examples | data | raw
gmat_cot 92 51 4.6 3.1 5.8 3.4 43 0 1.1 models | examples | data | raw
gpqa_cot 448 51 2.1 1.3 2.8 1.6 31 0 21 models | examples | data | raw
gre_physics_cot 75 51 5.2 3.7 6.6 4.2 32 0 5.3 models | examples | data | raw
gsm8k_cot 1319 52 1.1 0.78 1.1 0.64 61 0.15 0.99 models | examples | data | raw
gsm8k_plus_cot 10552 52 0.44 0.34 0.44 0.25 71 0.78 5.6 models | examples | data | raw
human_eval 164 51 3.5 2.6 4.4 2.6 49 0.61 0 models | examples | data | raw
human_eval_plus 164 51 3.6 2.8 4.3 2.6 53 6.1 1.2 models | examples | data | raw
jeebench_chat_cot 515 49 1.6 1.1 2 1.2 40 14 5.2 models | examples | data | raw
leetcode 180 51 3 2.4 3.4 2.2 54 29 0 models | examples | data | raw
lsat_cot 403 45 2.3 1.8 2.7 1.8 48 0.25 10 models | examples | data | raw
math500_cot 500 51 2 1.5 2.1 1.1 69 0.6 1 models | examples | data | raw
math_cot 5000 49 0.63 0.47 0.73 0.37 66 1.1 0.64 models | examples | data | raw
mbpp 500 35 2.1 1.7 2.1 1.4 69 3 2.4 models | examples | data | raw
mgsm_cot 2750 52 0.83 0.66 0.9 0.53 64 0.36 1.6 models | examples | data | raw
mmlu_pro_cot 12032 45 0.42 0.31 0.49 0.29 54 0.066 13 models | examples | data | raw

Notes: