mmlu_pro_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 69 39.3 2 0.42 0.34 0.24
qwen3-14b 67.2 37.8 2 0.43 0.37 0.22
qwen3-8b 62.5 34.2 2 0.44 0.38 0.23
llama-3.1-70B-instruct 60.3 33.2 3 0.45 0.45 0
qwen2-72b-instruct 60.1 32.7 1 0.45 NaN NaN
qwen2.5-coder-32b-instruct 59.3 31.9 1 0.45 NaN NaN
google_gemma_3_12b_it 58.4 31.2 3 0.45 0.38 0.24
qwen3-4b 57.7 31 3 0.45 0.39 0.23
mistralai_mixtral_8x22b_instruct_v0.1 47.1 23.9 2 0.46 0.31 0.33
deepseek_r1_distill_qwen_32b 46.6 23.2 1 0.45 NaN NaN
qwen1.5-72b-chat 46.5 23.7 1 0.45 NaN NaN
deepseek_r1_distill_qwen_14b 44.7 22 3 0.45 0.34 0.29
qwen2-math-72b-instruct 44.4 22.6 1 0.45 NaN NaN
qwen2.5-coder-14b-instruct 43.3 21.5 2 0.45 0.3 0.34
qwen1.5-32b-chat 43.3 21.5 2 0.45 0.31 0.33
qwen3-1.7b 42.4 21.3 3 0.45 0.36 0.27
google_gemma_3_4b_it 41.5 20.2 4 0.45 0.37 0.26
qwen2-7b-instruct 41 20 3 0.45 0.32 0.32
mistralai_mixtral_8x7b_instruct_v0.1 38.4 19 2 0.44 0.3 0.33
llama-3.1-8B-instruct 37.5 18.1 7 0.44 0.44 0
qwen1.5-14b-chat 36 17.1 2 0.44 0.31 0.3
mistralai_ministral_8b_instruct_2410 32.8 15.4 3 0.43 0.27 0.33
deepseek_r1_distill_llama_8b 32.4 14.5 3 0.43 0.3 0.3
qwen2.5-coder-7b-instruct 32.1 15.2 3 0.43 0.26 0.33
deepseek_r1_distill_qwen_7b 31.5 14.5 3 0.42 0.3 0.3
mistralai_mistral_7b_instruct_v0.3 31.3 14.7 3 0.42 0.29 0.31
mistralai_mathstral_7b_v0.1 29.7 14 3 0.42 0.24 0.34
llama-3.2-3B-instruct 29.2 13.8 10 0.41 0.41 0
qwen2-math-7b-instruct 29 14.3 3 0.41 0.25 0.33
mistralai_mistral_7b_instruct_v0.2 27.9 12.9 3 0.41 0.29 0.29
google_codegemma_1.1_7b_it 27 12.7 1 0.4 NaN NaN
deepseek_v2_lite_chat 26.1 12 2 0.4 0.25 0.31
qwen2.5-coder-3b-instruct 25.9 12 3 0.4 0.24 0.32
qwen1.5-7b-chat 23.1 10.5 3 0.38 0.23 0.31
qwen3-0.6b 23 11.6 4 0.38 0.26 0.28
mistralai_mistral_7b_instruct_v0.1 19 8.87 3 0.36 0.2 0.3
qwen2.5-coder-1.5b-instruct 16.9 8.11 3 0.34 0.17 0.3
qwen2-math-1.5b-instruct 16.6 8.31 3 0.34 0.17 0.29
llama-3.2-1B-instruct 16.5 8.09 12 0.34 0.34 0
deepseek_r1_distill_qwen_1.5b 15.9 7.17 3 0.33 0.18 0.28
qwen2-1.5b-instruct 11.6 5.82 3 0.29 0.12 0.27
qwen2-0.5b-instruct 9.51 5.55 4 0.27 0.091 0.25
qwen2.5-coder-0.5b-instruct 9.26 5.62 4 0.26 0.077 0.25
qwen1.5-1.8b-chat 8.84 4.47 3 0.26 0.11 0.23
qwen1.5-0.5b-chat 5.49 3.24 4 0.21 0.042 0.2