bbh_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
llama-3.1-70B-instruct	83.4	37	12	0.46	0.46	0
qwen3-14b	82.5	36.2	11	0.47	0.42	0.22
google_gemma_3_12b_it	81.8	36	12	0.48	0.4	0.26
qwen2-72b-instruct	79	34.2	10	0.5	0.4	0.31
qwen2-math-72b-instruct	78.5	34	10	0.51	0.4	0.32
qwen2.5-coder-32b-instruct	78.3	33.9	10	0.51	0.39	0.33
mistralai_mixtral_8x22b_instruct_v0.1	77.4	33	9	0.52	0.4	0.32
qwen3-32b	76.2	33.1	9	0.53	0.42	0.32
qwen3-4b	75.2	31.9	12	0.54	0.47	0.25
qwen3-8b	72.9	30.6	11	0.55	0.48	0.27
qwen2.5-coder-14b-instruct	71.4	29.9	11	0.56	0.41	0.38
qwen1.5-72b-chat	68.3	28.2	10	0.58	0.45	0.37
llama-3.1-8B-instruct	66.3	27.1	15	0.59	0.59	0
qwen1.5-32b-chat	66.2	26.8	9	0.59	0.44	0.39
google_gemma_3_4b_it	64.4	27.1	11	0.59	0.49	0.34
mistralai_mathstral_7b_v0.1	62.2	24.8	12	0.6	0.42	0.43
mistralai_mixtral_8x7b_instruct_v0.1	61.9	24.6	11	0.6	0.48	0.37
mistralai_ministral_8b_instruct_2410	60.2	23.8	11	0.61	0.41	0.45
qwen2.5-coder-7b-instruct	59.9	23.6	12	0.61	0.41	0.45
qwen2-math-7b-instruct	56.9	22.1	11	0.61	0.47	0.39
llama-3.2-3B-instruct	56.8	22	18	0.61	0.61	0
qwen2.5-coder-3b-instruct	54.2	20.9	12	0.62	0.43	0.44
mistralai_mistral_7b_instruct_v0.3	51.3	19.1	12	0.62	0.47	0.41
qwen3-1.7b	49.3	18.3	12	0.62	0.53	0.32
deepseek_v2_lite_chat	46.1	17.5	9	0.62	0.44	0.43
qwen1.5-14b-chat	42.3	15.9	11	0.61	0.39	0.47
mistralai_mistral_7b_instruct_v0.1	42.2	15.4	12	0.61	0.42	0.45
qwen2-math-1.5b-instruct	41.5	15.5	12	0.61	0.44	0.42
mistralai_mistral_7b_instruct_v0.2	40.5	14.3	12	0.61	0.45	0.41
qwen2-7b-instruct	38.5	13.9	12	0.6	0.38	0.46
qwen2.5-coder-1.5b-instruct	35.9	13.6	12	0.59	0.36	0.47
deepseek_r1_distill_llama_70b	35.6	12.9	9	0.59	0.37	0.46
llama-3.2-1B-instruct	34.9	13.3	21	0.59	0.59	0
deepseek_r1_distill_qwen_7b	33.1	11.5	12	0.58	0.37	0.45
qwen3-0.6b	33.1	12.2	12	0.58	0.42	0.4
deepseek_r1_distill_llama_8b	29.1	10.1	12	0.56	0.31	0.47
qwen2.5-coder-0.5b-instruct	27	10.5	12	0.55	0.34	0.43
qwen1.5-7b-chat	26.6	9.37	11	0.55	0.31	0.45
deepseek_r1_distill_qwen_14b	23.4	7.79	12	0.52	0.3	0.43
qwen2-1.5b-instruct	22.4	8.29	12	0.52	0.25	0.45
qwen1.5-0.5b-chat	21.7	8.47	12	0.51	0.29	0.42
deepseek_r1_distill_qwen_32b	21.5	7.26	7	0.51	0.26	0.44
qwen1.5-1.8b-chat	19.8	7.08	11	0.49	0.29	0.4
qwen2-0.5b-instruct	18.2	7.01	12	0.48	0.25	0.41
deepseek_r1_distill_qwen_1.5b	14	4.81	12	0.43	0.21	0.38