bbh_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-14b 82.1 41.4 3 0.48 0.41 0.24
google_gemma_3_12b_it 81.5 41.3 3 0.48 0.39 0.29
llama-3.1-70B-instruct 81.2 41.4 4 0.48 0.48 0
qwen3-4b 75 36.8 4 0.54 0.47 0.26
qwen2.5-coder-32b-instruct 74.8 36.7 2 0.54 0.39 0.37
qwen3-32b 73.3 36.2 2 0.55 0.41 0.36
qwen3-8b 72.6 35.2 3 0.55 0.47 0.29
mistralai_mixtral_8x22b_instruct_v0.1 72.3 35 2 0.55 0.39 0.39
qwen2-72b-instruct 71.8 35.2 2 0.56 0.39 0.4
qwen2-math-72b-instruct 68.6 33.2 2 0.58 0.38 0.43
qwen1.5-72b-chat 65.4 31.3 2 0.59 0.42 0.41
google_gemma_3_4b_it 64.4 31.2 5 0.59 0.47 0.36
qwen2.5-coder-14b-instruct 63.7 30.3 3 0.6 0.38 0.46
qwen1.5-32b-chat 60.9 28.3 2 0.6 0.41 0.45
mistralai_mixtral_8x7b_instruct_v0.1 58.3 26.9 3 0.61 0.45 0.42
llama-3.1-8B-instruct 54.7 25.1 7 0.62 0.62 0
mistralai_mathstral_7b_v0.1 53 24.1 3 0.62 0.38 0.49
qwen2-math-7b-instruct 51.9 23.5 4 0.62 0.43 0.44
llama-3.2-3B-instruct 49.9 22.3 10 0.62 0.62 0
qwen3-1.7b 49 21.5 4 0.62 0.52 0.34
mistralai_ministral_8b_instruct_2410 47.2 21.1 3 0.62 0.35 0.51
qwen2.5-coder-3b-instruct 46.7 20.8 4 0.62 0.37 0.49
qwen2.5-coder-7b-instruct 46.5 20.7 3 0.62 0.35 0.51
mistralai_mistral_7b_instruct_v0.3 45.1 19.5 3 0.62 0.42 0.45
deepseek_v2_lite_chat 41.4 18.4 2 0.61 0.39 0.47
qwen1.5-14b-chat 36.8 16.1 3 0.6 0.34 0.49
mistralai_mistral_7b_instruct_v0.1 35.8 15.3 3 0.59 0.36 0.47
mistralai_mistral_7b_instruct_v0.2 35.3 14.5 3 0.59 0.41 0.43
qwen2-math-1.5b-instruct 32.7 13.9 3 0.58 0.38 0.44
qwen3-0.6b 30.6 13.2 4 0.57 0.37 0.43
deepseek_r1_distill_llama_70b 30.4 12.9 2 0.57 0.33 0.47
qwen2.5-coder-1.5b-instruct 28.7 12.5 4 0.56 0.29 0.48
llama-3.2-1B-instruct 27.5 12.2 13 0.55 0.55 0
qwen2-7b-instruct 27.3 11.4 3 0.55 0.3 0.46
deepseek_r1_distill_qwen_7b 25.6 10.4 3 0.54 0.3 0.45
qwen1.5-7b-chat 22.3 9.25 3 0.52 0.26 0.45
deepseek_r1_distill_llama_8b 22.1 8.96 4 0.51 0.25 0.45
qwen2.5-coder-0.5b-instruct 21.7 9.77 4 0.51 0.24 0.45
deepseek_r1_distill_qwen_14b 19.2 7.61 3 0.49 0.26 0.41
deepseek_r1_distill_qwen_32b 17.3 6.94 2 0.47 0.22 0.41
qwen2-1.5b-instruct 11.6 5.08 4 0.4 0.15 0.37
qwen1.5-1.8b-chat 11.3 4.66 3 0.39 0.17 0.35
qwen1.5-0.5b-chat 10.6 4.57 4 0.38 0.16 0.35
qwen2-0.5b-instruct 9.41 4.28 4 0.36 0.13 0.34
deepseek_r1_distill_qwen_1.5b 8.6 3.56 4 0.35 0.13 0.32