Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

Full table: summary.csv

benchmark_id size models SE(A) SE_x(A) SE(A-B) SE_x(A-B) no_solve tau- details
swebench-pro 731 150 1.6 1.3 1.5 0.25 39 18 models | examples | data | more
swebench-verified 500 150 2.2 1.8 1.9 0.22 20 24 models | examples | data | more

Notes:


Generated on 2026-02-14 06:29:09