Measuring all the noises of LLM evals

Home Paper Code

Datasets: main, train_curve, Open models with vLLM: T=0.7, T=1, high K T=0.7,

Details are linked and more summary stats are available on mouse over.

size: number of questions in the eval
models: the number of models used on this eval
SE(A): the standard error (SE_x for data SE) in percent. A difference of 1.96 × 1.41 × SE(A) is the conventional standard
SE(A-B): the paired standard error (SE_x for data SE) in percent. A difference of 1.96 × SE(A-B) is the conventional standard
corr(A,B): correlation of expected scores in percent
no_solve: percent questions not solved by any models
tau-: percent questions negatively correlated with the overall model quality as measured by Kendall's tau
details: aggregating by the models; by examples; the prediction heatmap; and raw data tables.

benchmark_id	size	models	SE(A)	SE_x(A)	SE(A-B)	SE_x(A-B)	corr(A,B)	no_solve	tau-	details
aime2024_cot	30	52	7.3	5.7	8.1	4.7	61	20	0	models \| examples \| data \| raw
aime2025_cot	30	52	7.2	5.9	6.7	3.4	80	27	3.3	models \| examples \| data \| raw
ap_cot	711	52	1.6	1.2	1.9	1.1	51	0	3.1	models \| examples \| data \| raw
bbh_cot	6511	45	0.56	0.41	0.68	0.4	48	1.2	2.1	models \| examples \| data \| raw
cruxeval_input_cot	800	52	1.6	1.2	1.9	1	60	0.62	1.5	models \| examples \| data \| raw
cruxeval_output_cot	800	52	1.6	1.2	1.8	0.93	65	2.5	1.1	models \| examples \| data \| raw
ds1000	1000	39	1.3	1	1.3	0.8	65	27	0.9	models \| examples \| data \| raw
gmat_cot	92	51	4.6	3.1	5.8	3.4	43	0	1.1	models \| examples \| data \| raw
gpqa_cot	448	51	2.1	1.3	2.8	1.6	31	0	21	models \| examples \| data \| raw
gre_physics_cot	75	51	5.2	3.7	6.6	4.2	32	0	5.3	models \| examples \| data \| raw
gsm8k_cot	1319	52	1.1	0.78	1.1	0.64	61	0.15	0.99	models \| examples \| data \| raw
gsm8k_plus_cot	10552	52	0.44	0.34	0.44	0.25	71	0.78	5.6	models \| examples \| data \| raw
human_eval	164	51	3.5	2.6	4.4	2.6	49	0.61	0	models \| examples \| data \| raw
human_eval_plus	164	51	3.6	2.8	4.3	2.6	53	6.1	1.2	models \| examples \| data \| raw
jeebench_chat_cot	515	49	1.6	1.1	2	1.2	40	14	5.2	models \| examples \| data \| raw
leetcode	180	51	3	2.4	3.4	2.2	54	29	0	models \| examples \| data \| raw
lsat_cot	403	45	2.3	1.8	2.7	1.8	48	0.25	10	models \| examples \| data \| raw
math500_cot	500	51	2	1.5	2.1	1.1	69	0.6	1	models \| examples \| data \| raw
math_cot	5000	49	0.63	0.47	0.73	0.37	66	1.1	0.64	models \| examples \| data \| raw
mbpp	500	35	2.1	1.7	2.1	1.4	69	3	2.4	models \| examples \| data \| raw
mgsm_cot	2750	52	0.83	0.66	0.9	0.53	64	0.36	1.6	models \| examples \| data \| raw
mmlu_pro_cot	12032	45	0.42	0.31	0.49	0.29	54	0.066	13	models \| examples \| data \| raw

Notes:

Some datasets contains clustered questions (DS-1000, SAFIM, SQuAD, SWE-bench), a cluster correction might be needed (further increases the interval) but is not used.