Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

Raw data: summary.csv

benchmark_id size models SE(A) SE(A-B) corr(A,B) no_solve tau- details
CRUXEval-input-T0.2 800 37 1.7 1.8 56 2.4 10 models | examples | data | raw
CRUXEval-input-T0.8 800 31 1.6 1.8 69 2.1 10 models | examples | data | raw
CRUXEval-output-T0.2 800 37 1.7 1.6 64 3 3.8 models | examples | data | raw
CRUXEval-output-T0.8 800 31 1.7 1.6 75 4.4 6.9 models | examples | data | raw
DS1000 1000 105 1.4 1.5 46 11 1.2 models | examples | data | raw
agi_english 2546 35 0.93 1.1 25 0.98 21 models | examples | data | raw
arc_challenge 1165 38 1.4 1.3 62 14 11 models | examples | data | raw
gsm8k 1319 37 1.2 1.3 38 2.1 0.53 models | examples | data | raw
hellaswag 10042 36 0.41 0.26 77 6.1 1.6 models | examples | data | raw
humaneval 164 78 3.5 3.8 46 3.7 1.2 models | examples | data | raw
humaneval+ 164 49 3.7 3.9 45 4.3 1.8 models | examples | data | raw
lcb_codegen_v5 880 24 1.5 1.3 66 6.4 1.1 models | examples | data | raw
lcb_codegen_v6 1055 33 1.3 1.1 65 3.4 1.2 models | examples | data | raw
lcb_codegen_v6_080124 454 33 2.2 1.9 65 5.3 2 models | examples | data | raw
mbpp 378 59 2.4 2.4 47 2.4 4 models | examples | data | raw
mbpp+ 378 59 2.5 2.4 54 9.5 5.8 models | examples | data | raw
mmlu 14042 36 0.39 0.44 38 0.6 12 models | examples | data | raw
nq 3610 36 0.72 0.64 60 32 5.5 models | examples | data | raw
piqa 1838 36 0.93 0.7 71 5.3 8.4 models | examples | data | raw
safim 17410 22 0.36 0.36 50 16 7.9 models | examples | data | raw
siqa 1954 36 1.1 0.86 71 15 19 models | examples | data | raw
swebench-lite 300 84 2.7 2.6 52 12 1.7 models | examples | data | raw
swebench-test 2294 23 0.87 0.77 55 43 1 models | examples | data | raw
swebench-verified 500 131 2.1 1.9 58 6.8 1.6 models | examples | data | raw
tqa 11313 36 0.43 0.35 67 12 2.9 models | examples | data | raw

Notes: