Measuring all the noises of LLM evals

Details are linked and more summary stats are available on mouse over.

size: number of questions in the eval
models: the number of models used on this eval
SE(A): the standard error (SE_x for data SE) in percent. A difference of 1.96 × 1.41 × SE(A) is the conventional standard
SE(A-B): the paired standard error (SE_x for data SE) in percent. A difference of 1.96 × SE(A-B) is the conventional standard
no_solve: percent questions not solved by any models
tau-: percent questions negatively correlated with the overall model quality as measured by Kendall's tau
details: aggregating by the models; by examples; the prediction heatmap; and raw data tables.

benchmark_id	size	models	SE(A)	SE_x(A)	SE(A-B)	SE_x(A-B)	no_solve	tau-	details
aime2024_cot	30	51	7.4	5.5	7.6	3	37	0	models \| examples \| data \| more
aime2025_cot	30	52	7	6	6.7	4.2	50	0	models \| examples \| data \| more
ap_cot	711	52	1.6	1.1	1.9	1.1	0	3.2	models \| examples \| data \| more
bbh_cot	6511	45	0.55	0.37	0.68	0.36	2	1.6	models \| examples \| data \| more
cruxeval_input_cot	800	52	1.6	1.1	1.9	0.85	2.4	1.4	models \| examples \| data \| more
cruxeval_output_cot	800	52	1.6	1.1	1.8	0.83	4.4	0.5	models \| examples \| data \| more
ds1000	1000	39	1.2	0.93	1.3	0.66	32	0.3	models \| examples \| data \| more
gmat_cot	92	51	4.6	2.9	5.9	3.2	0	1.1	models \| examples \| data \| more
gpqa_cot	448	51	2.1	1.2	2.7	1.4	0	21	models \| examples \| data \| more
gre_physics_cot	75	51	5.2	3.4	6.7	3.9	0	4	models \| examples \| data \| more
gsm8k_cot	1319	52	1.1	0.76	1.2	0.61	0.076	0.91	models \| examples \| data \| more
gsm8k_plus_cot	10552	52	0.44	0.33	0.48	0.24	1.5	6.2	models \| examples \| data \| more
human_eval	164	51	3.5	2.4	4.4	2.2	1.2	0	models \| examples \| data \| more
human_eval_plus	164	51	3.6	2.5	4.3	2.3	6.7	0.61	models \| examples \| data \| more
jeebench_chat_cot	515	49	1.7	1.1	2.1	0.99	19	6	models \| examples \| data \| more
leetcode	180	51	3.1	2.3	3.5	1.8	39	0	models \| examples \| data \| more
lsat_cot	403	42	2.3	1.7	2.8	1.7	0.25	11	models \| examples \| data \| more
math500_cot	500	51	2	1.5	2.2	1	1.8	0.8	models \| examples \| data \| more
math_cot	5000	48	0.63	0.44	0.73	0.34	1.9	0.72	models \| examples \| data \| more
mbpp	500	35	2.1	1.7	2.2	1.2	3.8	2.2	models \| examples \| data \| more
mgsm_cot	2750	51	0.83	0.65	0.93	0.57	0.8	1.3	models \| examples \| data \| more
mmlu_pro_cot	12032	45	0.42	0.3	0.49	0.26	0.43	13	models \| examples \| data \| more

Some datasets contains clustered questions (DS-1000, SAFIM, SQuAD, SWE-bench), a cluster correction might be needed (further increases the interval) but is not used.

Generated on 2026-02-14 06:31:34