bbh_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-14b	82.1	41.4	3	0.48	0.41	0.24
google_gemma_3_12b_it	81.5	41.3	3	0.48	0.39	0.29
llama-3.1-70B-instruct	81.2	41.4	4	0.48	0.48	0
qwen3-4b	75	36.8	4	0.54	0.47	0.26
qwen2.5-coder-32b-instruct	74.8	36.7	2	0.54	0.39	0.37
qwen3-32b	73.3	36.2	2	0.55	0.41	0.36
qwen3-8b	72.6	35.2	3	0.55	0.47	0.29
mistralai_mixtral_8x22b_instruct_v0.1	72.3	35	2	0.55	0.39	0.39
qwen2-72b-instruct	71.8	35.2	2	0.56	0.39	0.4
qwen2-math-72b-instruct	68.6	33.2	2	0.58	0.38	0.43
qwen1.5-72b-chat	65.4	31.3	2	0.59	0.42	0.41
google_gemma_3_4b_it	64.4	31.2	5	0.59	0.47	0.36
qwen2.5-coder-14b-instruct	63.7	30.3	3	0.6	0.38	0.46
qwen1.5-32b-chat	60.9	28.3	2	0.6	0.41	0.45
mistralai_mixtral_8x7b_instruct_v0.1	58.3	26.9	3	0.61	0.45	0.42
llama-3.1-8B-instruct	54.7	25.1	7	0.62	0.62	0
mistralai_mathstral_7b_v0.1	53	24.1	3	0.62	0.38	0.49
qwen2-math-7b-instruct	51.9	23.5	4	0.62	0.43	0.44
llama-3.2-3B-instruct	49.9	22.3	10	0.62	0.62	0
qwen3-1.7b	49	21.5	4	0.62	0.52	0.34
mistralai_ministral_8b_instruct_2410	47.2	21.1	3	0.62	0.35	0.51
qwen2.5-coder-3b-instruct	46.7	20.8	4	0.62	0.37	0.49
qwen2.5-coder-7b-instruct	46.5	20.7	3	0.62	0.35	0.51
mistralai_mistral_7b_instruct_v0.3	45.1	19.5	3	0.62	0.42	0.45
deepseek_v2_lite_chat	41.4	18.4	2	0.61	0.39	0.47
qwen1.5-14b-chat	36.8	16.1	3	0.6	0.34	0.49
mistralai_mistral_7b_instruct_v0.1	35.8	15.3	3	0.59	0.36	0.47
mistralai_mistral_7b_instruct_v0.2	35.3	14.5	3	0.59	0.41	0.43
qwen2-math-1.5b-instruct	32.7	13.9	3	0.58	0.38	0.44
qwen3-0.6b	30.6	13.2	4	0.57	0.37	0.43
deepseek_r1_distill_llama_70b	30.4	12.9	2	0.57	0.33	0.47
qwen2.5-coder-1.5b-instruct	28.7	12.5	4	0.56	0.29	0.48
llama-3.2-1B-instruct	27.5	12.2	13	0.55	0.55	0
qwen2-7b-instruct	27.3	11.4	3	0.55	0.3	0.46
deepseek_r1_distill_qwen_7b	25.6	10.4	3	0.54	0.3	0.45
qwen1.5-7b-chat	22.3	9.25	3	0.52	0.26	0.45
deepseek_r1_distill_llama_8b	22.1	8.96	4	0.51	0.25	0.45
qwen2.5-coder-0.5b-instruct	21.7	9.77	4	0.51	0.24	0.45
deepseek_r1_distill_qwen_14b	19.2	7.61	3	0.49	0.26	0.41
deepseek_r1_distill_qwen_32b	17.3	6.94	2	0.47	0.22	0.41
qwen2-1.5b-instruct	11.6	5.08	4	0.4	0.15	0.37
qwen1.5-1.8b-chat	11.3	4.66	3	0.39	0.17	0.35
qwen1.5-0.5b-chat	10.6	4.57	4	0.38	0.16	0.35
qwen2-0.5b-instruct	9.41	4.28	4	0.36	0.13	0.34
deepseek_r1_distill_qwen_1.5b	8.6	3.56	4	0.35	0.13	0.32