jeebench_chat_cot: by examples

Results Paper Code


Not solved by any model

There are 97 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
125, 223, 244, 246, 267, 281, 283, 284, 288, 292, 370, 376, 378, 379, 381, 386, 387, 388, 389, 393, 394, 397, 404, 406, 407, 408, 409, 411, 412, 413, 414, 415, 417, 419, 420, 421, 422, 423, 424, 427, 428, 429, 431, 435, 438, 442, 443, 445, 451, 452, 453, 454, 455, 456, 457, 458, 459, 461, 462, 463, 464, 466, 467, 469, 470, 471, 472, 473, 474, 476, 477, 478, 479, 481, 482, 484, 485, 487, 488, 489, 491, 492, 495, 497, 498, 499, 500, 501, 502, 503, 506, 507, 508, 509, 510, 512, 513

Problems solved by 1 model only

example_link model min_pass1_of_model
148 qwen3-32b 0.268
447 qwen3-32b 0.268
441 qwen3-32b 0.268
391 qwen3-32b 0.268
448 qwen3-32b 0.268
209 qwen3-32b 0.268
133 qwen3-14b 0.258
174 qwen3-14b 0.258
112 qwen3-4b 0.223
150 qwen3-4b 0.223
237 qwen2-72b-instruct 0.221
273 qwen2.5-coder-32b-instruct 0.212
213 qwen2.5-coder-32b-instruct 0.212
211 qwen2.5-coder-32b-instruct 0.212
202 qwen3-8b 0.188
197 google_gemma_2_27b_it 0.132
132 google_gemma_3_4b_it 0.115
199 google_gemma_3_4b_it 0.115
196 google_gemma_2_9b_it 0.091
511 google_gemma_2_9b_it 0.091
433 qwen1.5-14b-chat 0.086
460 qwen3-1.7b 0.070
486 llama-3.1-8B-instruct 0.066
277 google_codegemma_1.1_7b_it 0.063
480 mistralai_mistral_7b_instruct_v0.2 0.062
374 mistralai_mistral_7b_instruct_v0.3 0.061
384 qwen2-7b-instruct 0.060
252 qwen2-7b-instruct 0.060
188 qwen2-1.5b-instruct 0.056
259 qwen2-1.5b-instruct 0.056
383 qwen2-math-7b-instruct 0.056
396 deepseek_r1_distill_qwen_14b 0.051
216 qwen2.5-coder-3b-instruct 0.048
468 google_gemma_2b_it 0.045
368 qwen3-0.6b 0.032
390 qwen2-0.5b-instruct 0.031
490 deepseek_r1_distill_llama_8b 0.019
364 qwen2.5-coder-1.5b-instruct 0.018

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
364 0.005 -0.194
490 0.005 -0.169
390 0.004 -0.143
368 0.004 -0.135
30 0.077 -0.132
494 0.104 -0.118
48 0.070 -0.104
28 0.175 -0.103
192 0.021 -0.101
119 0.012 -0.096

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.