Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

Raw data: summary.csv

benchmark_id size models SE(A) SE_x(A) SE(A-B) SE_x(A-B) corr(A,B) no_solve tau- details
swebench-pro 731 150 1.6 1.3 1.5 0.25 71 39 18 models | examples | data | raw
swebench-verified 500 150 2.2 1.8 1.9 0.22 76 20 24 models | examples | data | raw

Notes: