Measuring all the noises of LLM evals
Home Paper Code
Details are linked and more summary stats are available on mouse over.
- size: number of questions in the eval
- models: the number of models used on this eval
- SE(A): the standard error (SE_x for data SE) in percent. A difference of 1.96 × 1.41 × SE(A) is the conventional standard
- SE(A-B): the paired standard error (SE_x for data SE) in percent. A difference of 1.96 × SE(A-B) is the conventional standard
- no_solve: percent questions not solved by any models
- tau-: percent questions negatively correlated with the overall model quality as measured by Kendall's tau
- details: aggregating by the models; by examples; the prediction heatmap; and raw data tables.
Full table: summary.csv
Notes:
- Some datasets contains clustered questions (DS-1000, SAFIM, SQuAD, SWE-bench), a cluster correction might be needed (further increases the interval) but is not used.
Generated on 2026-02-14 06:27:28