jeebench_chat_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
google_gemma_3_12b_it 27.6 22.9 11 2 1.4 1.4
qwen2-72b-instruct 25.4 21 10 1.9 1.3 1.4
qwen3-32b 24.9 20.6 10 1.9 1.3 1.4
qwen3-14b 24.6 20.1 12 1.9 1.4 1.3
qwen2.5-coder-32b-instruct 23.7 19.2 10 1.9 1.3 1.3
qwen3-4b 21.7 17.5 12 1.8 1.3 1.3
qwen3-8b 17.4 14 12 1.7 1.2 1.2
qwen2.5-coder-14b-instruct 16.8 13.3 12 1.6 1 1.3
qwen1.5-32b-chat 13.5 10.9 11 1.5 0.93 1.2
qwen1.5-72b-chat 13.4 10.7 10 1.5 0.92 1.2
google_gemma_7b_it 13 10.9 13 1.5 1.1 1
google_gemma_2_27b_it 12.7 10.4 10 1.5 0.99 1.1
qwen2-math-72b-instruct 11.4 9.05 10 1.4 1.1 0.85
qwen2.5-coder-7b-instruct 11.2 8.76 10 1.4 0.75 1.2
google_gemma_3_4b_it 11.1 8.85 13 1.4 0.86 1.1
llama-3.1-8B-instruct 10.5 8.59 15 1.4 1.4 0
google_gemma_2_9b_it 9.29 7.51 11 1.3 0.87 0.94
qwen1.5-14b-chat 9.17 7.42 12 1.3 0.69 1.1
mistralai_mixtral_8x22b_instruct_v0.1 8.93 7.07 11 1.3 0.65 1.1
qwen2-1.5b-instruct 7.57 6.5 13 1.2 0.59 1
google_codegemma_1.1_7b_it 7.45 6 13 1.2 0.5 1
qwen2-7b-instruct 7.13 5.52 11 1.1 0.59 0.97
deepseek_r1_distill_qwen_32b 6.64 4.9 10 1.1 0.73 0.82
mistralai_mistral_7b_instruct_v0.3 6.62 5.33 11 1.1 0.55 0.95
qwen3-1.7b 6.62 5.05 12 1.1 0.74 0.81
qwen1.5-7b-chat 6.57 5.29 12 1.1 0.5 0.97
llama-3.2-3B-instruct 6.41 5.07 17 1.1 1.1 0
deepseek_v2_lite_chat 6.41 5.21 11 1.1 0.47 0.97
qwen2-math-7b-instruct 6.15 4.61 6 1.1 0.76 0.74
deepseek_r1_distill_qwen_14b 6.02 4.48 11 1 0.68 0.79
mistralai_ministral_8b_instruct_2410 5.63 4.41 11 1 0.45 0.91
mistralai_mistral_7b_instruct_v0.1 5.53 4.49 11 1 0.45 0.9
deepseek_r1_distill_llama_70b 5.48 3.98 10 1 0.66 0.75
mistralai_mistral_7b_instruct_v0.2 5.32 4.33 10 0.99 0.5 0.85
qwen2.5-coder-3b-instruct 5.16 3.92 12 0.97 0.42 0.88
google_gemma_2b_it 5.11 4.1 13 0.97 0.74 0.62
mistralai_mathstral_7b_v0.1 5.07 3.98 11 0.97 0.38 0.89
qwen2-0.5b-instruct 4.9 4.03 13 0.95 0.41 0.86
qwen1.5-1.8b-chat 4.31 3.5 11 0.89 0.35 0.82
llama-3.2-1B-instruct 4.27 3.47 12 0.89 0.89 0
google_gemma_3_1b_it 4.22 3.15 12 0.89 0.51 0.72
deepseek_r1_distill_qwen_7b 4.17 2.95 11 0.88 0.59 0.66
qwen2-math-1.5b-instruct 4.03 2.95 4 0.87 0.54 0.67
qwen1.5-0.5b-chat 3.91 3.26 13 0.85 0.37 0.77
deepseek_r1_distill_llama_8b 3.54 2.57 12 0.81 0.43 0.69
qwen3-0.6b 2.66 1.94 13 0.71 0.37 0.61
deepseek_r1_distill_qwen_1.5b 2.25 1.57 12 0.65 0.35 0.55
qwen2.5-coder-1.5b-instruct 1.8 1.33 11 0.59 0.23 0.54
qwen2.5-coder-0.5b-instruct 1.69 1.33 13 0.57 0.17 0.54