Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

Full table: summary.csv

benchmark_id size models SE(A) SE(A-B) no_solve tau- details
CRUXEval-input-T0.2 800 37 1.7 1.8 2.4 10 models | examples | data | more
CRUXEval-input-T0.8 800 31 1.6 1.8 2.1 10 models | examples | data | more
CRUXEval-output-T0.2 800 37 1.7 1.6 3 3.8 models | examples | data | more
CRUXEval-output-T0.8 800 31 1.7 1.6 4.4 6.9 models | examples | data | more
DS1000 1000 105 1.4 1.5 11 1.2 models | examples | data | more
agi_english 2546 35 0.93 1.1 0.98 21 models | examples | data | more
arc_challenge 1165 38 1.4 1.3 14 11 models | examples | data | more
gsm8k 1319 37 1.2 1.3 2.1 0.53 models | examples | data | more
hellaswag 10042 36 0.41 0.26 6.1 1.6 models | examples | data | more
humaneval 164 78 3.5 3.8 3.7 1.2 models | examples | data | more
humaneval+ 164 49 3.7 3.9 4.3 1.8 models | examples | data | more
lcb_codegen_v5 880 24 1.5 1.3 6.4 1.1 models | examples | data | more
lcb_codegen_v6 1055 33 1.3 1.1 3.4 1.2 models | examples | data | more
lcb_codegen_v6_080124 454 33 2.2 1.9 5.3 2 models | examples | data | more
mbpp 378 59 2.4 2.4 2.4 4 models | examples | data | more
mbpp+ 378 59 2.5 2.4 9.5 5.8 models | examples | data | more
mmlu 14042 36 0.39 0.44 0.6 12 models | examples | data | more
nq 3610 36 0.72 0.64 32 5.5 models | examples | data | more
piqa 1838 36 0.93 0.7 5.3 8.4 models | examples | data | more
safim 17410 22 0.36 0.36 16 7.9 models | examples | data | more
siqa 1954 36 1.1 0.86 15 19 models | examples | data | more
swebench-bash-only 500 28 2.1 2 13 2 models | examples | data | more
swebench-lite 300 84 2.7 2.6 12 1.7 models | examples | data | more
swebench-multimodal 510 12 2.1 1.4 51 5.3 models | examples | data | more
swebench-test 2294 23 0.87 0.77 43 1 models | examples | data | more
swebench-verified 500 131 2.1 1.9 6.8 1.6 models | examples | data | more
terminal-bench-1.0 80 25 5.4 5.3 18 5 models | examples | data | more
terminal-bench-2.0 89 7 5.1 5.8 9 12 models | examples | data | more
tqa 11313 36 0.43 0.35 12 2.9 models | examples | data | more

Notes:


Generated on 2026-02-14 06:32:53