jeebench_chat_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen3-32b	26.8	22.8	3	2	1.2	1.5
google_gemma_3_12b_it	26.1	21.9	4	1.9	1.4	1.4
qwen3-14b	25.8	21.7	3	1.9	1.3	1.4
qwen3-4b	22.3	18.5	4	1.8	1.3	1.3
qwen2-72b-instruct	22.1	18.5	2	1.8	1.1	1.4
qwen2.5-coder-32b-instruct	21.2	17.5	2	1.8	1.1	1.4
qwen3-8b	18.8	15.6	3	1.7	1.1	1.3
qwen2.5-coder-14b-instruct	15.1	12.3	3	1.6	0.88	1.3
qwen1.5-32b-chat	14.6	12.1	3	1.6	0.91	1.3
google_gemma_2_27b_it	13.2	11	2	1.5	1	1.1
qwen1.5-72b-chat	13	10.7	2	1.5	0.83	1.2
google_gemma_7b_it	11.9	10.1	4	1.4	1	1
google_gemma_3_4b_it	11.5	9.42	5	1.4	0.87	1.1
google_gemma_2_9b_it	9.13	7.57	4	1.3	0.78	1
qwen2.5-coder-7b-instruct	8.74	7.02	4	1.2	0.57	1.1
qwen1.5-14b-chat	8.61	7.1	3	1.2	0.65	1.1
mistralai_mixtral_8x22b_instruct_v0.1	8.48	6.71	3	1.2	0.64	1
qwen2-math-72b-instruct	7.96	6.29	2	1.2	0.84	0.85
qwen3-1.7b	6.99	5.5	4	1.1	0.72	0.86
llama-3.1-8B-instruct	6.6	5.52	7	1.1	1.1	0
google_codegemma_1.1_7b_it	6.33	5.16	5	1.1	0.4	0.99
mistralai_mistral_7b_instruct_v0.2	6.17	5.21	4	1.1	0.46	0.96
mistralai_mistral_7b_instruct_v0.3	6.12	4.97	4	1.1	0.51	0.92
qwen2-7b-instruct	6.02	4.82	4	1	0.37	0.98
qwen2-1.5b-instruct	5.63	4.91	4	1	0.38	0.94
qwen2-math-7b-instruct	5.63	4.32	2	1	0.63	0.8
qwen1.5-7b-chat	5.44	4.55	3	1	0.31	0.95
deepseek_v2_lite_chat	5.37	4.53	3	0.99	0.33	0.94
deepseek_r1_distill_qwen_14b	5.1	3.87	4	0.97	0.57	0.79
qwen2.5-coder-3b-instruct	4.76	3.84	4	0.94	0.39	0.85
llama-3.2-3B-instruct	4.66	3.86	10	0.93	0.93	0
llama-3.2-1B-instruct	4.66	3.9	13	0.93	0.93	0
deepseek_r1_distill_qwen_32b	4.66	3.54	2	0.93	0.47	0.8
google_gemma_3_1b_it	4.56	3.57	4	0.92	0.52	0.76
google_gemma_2b_it	4.47	3.71	4	0.91	0.62	0.66
deepseek_r1_distill_llama_70b	4.37	3.31	2	0.9	0.61	0.66
mistralai_mathstral_7b_v0.1	4.27	3.42	4	0.89	0.34	0.82
mistralai_ministral_8b_instruct_2410	3.83	3.05	4	0.85	0.27	0.8
qwen2-math-1.5b-instruct	3.56	2.63	3	0.82	0.5	0.64
qwen1.5-1.8b-chat	3.43	2.83	3	0.8	0.3	0.74
qwen3-0.6b	3.18	2.45	5	0.77	0.35	0.69
qwen2-0.5b-instruct	3.15	2.7	5	0.77	0.24	0.73
mistralai_mistral_7b_instruct_v0.1	3.11	2.58	4	0.76	0.26	0.72
deepseek_r1_distill_qwen_7b	2.77	2.01	4	0.72	0.48	0.54
deepseek_r1_distill_llama_8b	1.89	1.43	4	0.6	0.27	0.53
qwen2.5-coder-0.5b-instruct	1.86	1.47	5	0.6	0.21	0.56
qwen1.5-0.5b-chat	1.86	1.56	5	0.6	0.14	0.58
qwen2.5-coder-1.5b-instruct	1.8	1.39	4	0.59	0.19	0.55
deepseek_r1_distill_qwen_1.5b	1.21	0.865	4	0.48	0.19	0.44