jeebench_chat_cot: by examples

Results Paper Code

Not solved by any model

There are 97 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
125, 223, 244, 246, 267, 281, 283, 284, 288, 292, 370, 376, 378, 379, 381, 386, 387, 388, 389, 393, 394, 397, 404, 406, 407, 408, 409, 411, 412, 413, 414, 415, 417, 419, 420, 421, 422, 423, 424, 427, 428, 429, 431, 435, 438, 442, 443, 445, 451, 452, 453, 454, 455, 456, 457, 458, 459, 461, 462, 463, 464, 466, 467, 469, 470, 471, 472, 473, 474, 476, 477, 478, 479, 481, 482, 484, 485, 487, 488, 489, 491, 492, 495, 497, 498, 499, 500, 501, 502, 503, 506, 507, 508, 509, 510, 512, 513

Problems solved by 1 model only

example_link	model	min_pass1_of_model
148	qwen3-32b	0.268
447	qwen3-32b	0.268
441	qwen3-32b	0.268
391	qwen3-32b	0.268
448	qwen3-32b	0.268
209	qwen3-32b	0.268
133	qwen3-14b	0.258
174	qwen3-14b	0.258
112	qwen3-4b	0.223
150	qwen3-4b	0.223
237	qwen2-72b-instruct	0.221
273	qwen2.5-coder-32b-instruct	0.212
213	qwen2.5-coder-32b-instruct	0.212
211	qwen2.5-coder-32b-instruct	0.212
202	qwen3-8b	0.188
197	google_gemma_2_27b_it	0.132
132	google_gemma_3_4b_it	0.115
199	google_gemma_3_4b_it	0.115
196	google_gemma_2_9b_it	0.091
511	google_gemma_2_9b_it	0.091
433	qwen1.5-14b-chat	0.086
460	qwen3-1.7b	0.070
486	llama-3.1-8B-instruct	0.066
277	google_codegemma_1.1_7b_it	0.063
480	mistralai_mistral_7b_instruct_v0.2	0.062
374	mistralai_mistral_7b_instruct_v0.3	0.061
384	qwen2-7b-instruct	0.060
252	qwen2-7b-instruct	0.060
188	qwen2-1.5b-instruct	0.056
259	qwen2-1.5b-instruct	0.056
383	qwen2-math-7b-instruct	0.056
396	deepseek_r1_distill_qwen_14b	0.051
216	qwen2.5-coder-3b-instruct	0.048
468	google_gemma_2b_it	0.045
368	qwen3-0.6b	0.032
390	qwen2-0.5b-instruct	0.031
490	deepseek_r1_distill_llama_8b	0.019
364	qwen2.5-coder-1.5b-instruct	0.018

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
364	0.005	-0.194
490	0.005	-0.169
390	0.004	-0.143
368	0.004	-0.135
30	0.077	-0.132
494	0.104	-0.118
48	0.070	-0.104
28	0.175	-0.103
192	0.021	-0.101
119	0.012	-0.096

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.