Measuring all the noises of LLM evals

Details are linked and more summary stats are available on mouse over.

size: number of questions in the eval
models: the number of models used on this eval
SE(A): the standard error (SE_x for data SE) in percent. A difference of 1.96 × 1.41 × SE(A) is the conventional standard
SE(A-B): the paired standard error (SE_x for data SE) in percent. A difference of 1.96 × SE(A-B) is the conventional standard
corr(A,B): correlation of expected scores in percent
no_solve: percent questions not solved by any models
tau-: percent questions negatively correlated with the overall model quality as measured by Kendall's tau
details: aggregating by the models; by examples; the prediction heatmap; and raw data tables.

benchmark_id	size	models	SE(A)	SE_x(A)	SE(A-B)	SE_x(A-B)	corr(A,B)	no_solve	tau-	details
aime2024_cot	30	50	7.6	5.7	8	3.7	77	3.3	0	models \| examples \| data \| raw
aime2025_cot	30	50	7	5.6	6.8	3.3	82	3.3	13	models \| examples \| data \| raw
human_eval_plus	164	49	3.6	2.7	4.3	2.4	61	6.1	1.2	models \| examples \| data \| raw
math500_cot	500	48	2	1.5	2.1	0.94	79	0.4	1.2	models \| examples \| data \| raw

Some datasets contains clustered questions (DS-1000, SAFIM, SQuAD, SWE-bench), a cluster correction might be needed (further increases the interval) but is not used.