jeebench_chat_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 26.8 22.8 3 2 1.2 1.5
google_gemma_3_12b_it 26.1 21.9 4 1.9 1.4 1.4
qwen3-14b 25.8 21.7 3 1.9 1.3 1.4
qwen3-4b 22.3 18.5 4 1.8 1.3 1.3
qwen2-72b-instruct 22.1 18.5 2 1.8 1.1 1.4
qwen2.5-coder-32b-instruct 21.2 17.5 2 1.8 1.1 1.4
qwen3-8b 18.8 15.6 3 1.7 1.1 1.3
qwen2.5-coder-14b-instruct 15.1 12.3 3 1.6 0.88 1.3
qwen1.5-32b-chat 14.6 12.1 3 1.6 0.91 1.3
google_gemma_2_27b_it 13.2 11 2 1.5 1 1.1
qwen1.5-72b-chat 13 10.7 2 1.5 0.83 1.2
google_gemma_7b_it 11.9 10.1 4 1.4 1 1
google_gemma_3_4b_it 11.5 9.42 5 1.4 0.87 1.1
google_gemma_2_9b_it 9.13 7.57 4 1.3 0.78 1
qwen2.5-coder-7b-instruct 8.74 7.02 4 1.2 0.57 1.1
qwen1.5-14b-chat 8.61 7.1 3 1.2 0.65 1.1
mistralai_mixtral_8x22b_instruct_v0.1 8.48 6.71 3 1.2 0.64 1
qwen2-math-72b-instruct 7.96 6.29 2 1.2 0.84 0.85
qwen3-1.7b 6.99 5.5 4 1.1 0.72 0.86
llama-3.1-8B-instruct 6.6 5.52 7 1.1 1.1 0
google_codegemma_1.1_7b_it 6.33 5.16 5 1.1 0.4 0.99
mistralai_mistral_7b_instruct_v0.2 6.17 5.21 4 1.1 0.46 0.96
mistralai_mistral_7b_instruct_v0.3 6.12 4.97 4 1.1 0.51 0.92
qwen2-7b-instruct 6.02 4.82 4 1 0.37 0.98
qwen2-1.5b-instruct 5.63 4.91 4 1 0.38 0.94
qwen2-math-7b-instruct 5.63 4.32 2 1 0.63 0.8
qwen1.5-7b-chat 5.44 4.55 3 1 0.31 0.95
deepseek_v2_lite_chat 5.37 4.53 3 0.99 0.33 0.94
deepseek_r1_distill_qwen_14b 5.1 3.87 4 0.97 0.57 0.79
qwen2.5-coder-3b-instruct 4.76 3.84 4 0.94 0.39 0.85
llama-3.2-3B-instruct 4.66 3.86 10 0.93 0.93 0
llama-3.2-1B-instruct 4.66 3.9 13 0.93 0.93 0
deepseek_r1_distill_qwen_32b 4.66 3.54 2 0.93 0.47 0.8
google_gemma_3_1b_it 4.56 3.57 4 0.92 0.52 0.76
google_gemma_2b_it 4.47 3.71 4 0.91 0.62 0.66
deepseek_r1_distill_llama_70b 4.37 3.31 2 0.9 0.61 0.66
mistralai_mathstral_7b_v0.1 4.27 3.42 4 0.89 0.34 0.82
mistralai_ministral_8b_instruct_2410 3.83 3.05 4 0.85 0.27 0.8
qwen2-math-1.5b-instruct 3.56 2.63 3 0.82 0.5 0.64
qwen1.5-1.8b-chat 3.43 2.83 3 0.8 0.3 0.74
qwen3-0.6b 3.18 2.45 5 0.77 0.35 0.69
qwen2-0.5b-instruct 3.15 2.7 5 0.77 0.24 0.73
mistralai_mistral_7b_instruct_v0.1 3.11 2.58 4 0.76 0.26 0.72
deepseek_r1_distill_qwen_7b 2.77 2.01 4 0.72 0.48 0.54
deepseek_r1_distill_llama_8b 1.89 1.43 4 0.6 0.27 0.53
qwen2.5-coder-0.5b-instruct 1.86 1.47 5 0.6 0.21 0.56
qwen1.5-0.5b-chat 1.86 1.56 5 0.6 0.14 0.58
qwen2.5-coder-1.5b-instruct 1.8 1.39 4 0.59 0.19 0.55
deepseek_r1_distill_qwen_1.5b 1.21 0.865 4 0.48 0.19 0.44