Measuring all the noises of LLM evals
Home Paper Code
Details are linked and more summary stats are available on mouse over.
- size: number of questions in the eval
- models: the number of models used on this eval
- SE(A): the standard error (SE_x for data SE) in percent. A difference of 1.96 × 1.41 × SE(A) is the conventional standard
- SE(A-B): the paired standard error (SE_x for data SE) in percent. A difference of 1.96 × SE(A-B) is the conventional standard
- no_solve: percent questions not solved by any models
- tau-: percent questions negatively correlated with the overall model quality as measured by Kendall's tau
- details: aggregating by the models; by examples; the prediction heatmap; and raw data tables.
Full table: summary.csv
| benchmark_id |
size |
models |
SE(A) |
SE_x(A) |
SE(A-B) |
SE_x(A-B) |
no_solve |
tau- |
details |
| swebench-pro |
731 |
150 |
1.6 |
1.3 |
1.5 |
0.25 |
39 |
18 |
models | examples | data | more |
| swebench-verified |
500 |
150 |
2.2 |
1.8 |
1.9 |
0.22 |
20 |
24 |
models | examples | data | more |
Notes:
- Some datasets contains clustered questions (DS-1000, SAFIM, SQuAD, SWE-bench), a cluster correction might be needed (further increases the interval) but is not used.
Generated on 2026-02-14 06:29:09