Measuring all the noises of LLM evals

Details are linked and more summary stats are available on mouse over.

size: number of questions in the eval
models: the number of models used on this eval
SE(A): the standard error (SE_x for data SE) in percent. A difference of 1.96 × 1.41 × SE(A) is the conventional standard
SE(A-B): the paired standard error (SE_x for data SE) in percent. A difference of 1.96 × SE(A-B) is the conventional standard
no_solve: percent questions not solved by any models
tau-: percent questions negatively correlated with the overall model quality as measured by Kendall's tau
details: aggregating by the models; by examples; the prediction heatmap; and raw data tables.

benchmark_id	size	models	SE(A)	SE_x(A)	SE(A-B)	SE_x(A-B)	no_solve	tau-	details
swebench-pro	731	150	1.6	1.3	1.5	0.25	39	18	models \| examples \| data \| more
swebench-verified	500	150	2.2	1.8	1.9	0.22	20	24	models \| examples \| data \| more

Some datasets contains clustered questions (DS-1000, SAFIM, SQuAD, SWE-bench), a cluster correction might be needed (further increases the interval) but is not used.

Generated on 2026-02-14 06:29:09