jeebench_chat_cot: by examples

Results Paper Code

Not solved by any model

There are 72 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
368, 370, 376, 379, 381, 387, 388, 389, 391, 393, 394, 404, 406, 407, 411, 412, 413, 414, 417, 419, 420, 421, 422, 423, 424, 427, 428, 429, 431, 435, 442, 451, 452, 453, 454, 455, 456, 457, 459, 461, 462, 463, 466, 467, 469, 470, 471, 473, 474, 477, 478, 479, 481, 482, 484, 485, 487, 488, 489, 490, 491, 495, 497, 500, 501, 502, 507, 508, 509, 510, 512, 513

Problems solved by 1 model only

example_link	model	min_pass1_of_model
499	google_gemma_3_12b_it	0.276
244	qwen2-72b-instruct	0.254
443	qwen3-32b	0.249
223	qwen3-32b	0.249
498	qwen2.5-coder-32b-instruct	0.237
378	qwen1.5-72b-chat	0.134
383	qwen2-math-72b-instruct	0.114
458	llama-3.1-8B-instruct	0.105
397	qwen3-1.7b	0.066
415	qwen3-1.7b	0.066
386	mistralai_mathstral_7b_v0.1	0.051
390	google_gemma_3_1b_it	0.042
464	deepseek_r1_distill_llama_8b	0.035
329	qwen2.5-coder-0.5b-instruct	0.017

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
329	0.002	-0.202
401	0.019	-0.192
464	0.002	-0.168
382	0.077	-0.156
436	0.045	-0.149
384	0.012	-0.141
286	0.021	-0.138
390	0.002	-0.135
503	0.022	-0.135
30	0.061	-0.134

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.