bbh_cot: by examples

Results Paper Code


Not solved by any model

There are 76 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1003, 1008, 1010, 1011, 1012, 1024, 1025, 1032, 1039, 1049, 1057, 1072, 1082, 1107, 1111, 1121, 1123, 1132, 1135, 1137, 1146, 1149, 1152, 1162, 1164, 1172, 1174, 2850, 4182, 4227, 6262, 6267, 6272, 6276, 6277, 6285, 6286, 6294, 6296, 6298, 6310, 6316, 6318, 6319, 6328, 6335, 6342, 6347, 6350, 6351, 6354, 6362, 6366, 6367, 6376, 6377, 6387, 6391, 6398, 6401, 6418, 6430, 6433, 6434, 6438, 6442, 6447, 6457, 6461, 6476, 6494, 6507, 949, 950, 956, 984

Problems solved by 1 model only

example_link model min_pass1_of_model
1092 google_gemma_3_12b_it 0.818
1077 google_gemma_3_12b_it 0.818
6332 google_gemma_3_12b_it 0.818
6358 google_gemma_3_12b_it 0.818
6326 google_gemma_3_12b_it 0.818
6330 google_gemma_3_12b_it 0.818
6407 google_gemma_3_12b_it 0.818
6363 qwen2-72b-instruct 0.790
6497 qwen2-72b-instruct 0.790
1059 qwen2-72b-instruct 0.790
965 qwen2-72b-instruct 0.790
6341 qwen2-72b-instruct 0.790
6273 qwen2-72b-instruct 0.790
6305 qwen2-72b-instruct 0.790
6375 qwen2-72b-instruct 0.790
6440 qwen2-72b-instruct 0.790
6426 qwen2-72b-instruct 0.790
6459 qwen2-72b-instruct 0.790
6449 qwen2-72b-instruct 0.790
6410 qwen2-72b-instruct 0.790
6502 qwen2.5-coder-32b-instruct 0.783
6388 qwen2.5-coder-32b-instruct 0.783
6488 qwen2.5-coder-32b-instruct 0.783
6480 qwen2.5-coder-32b-instruct 0.783
6331 qwen2.5-coder-32b-instruct 0.783
6340 qwen2.5-coder-32b-instruct 0.783
6322 qwen2.5-coder-32b-instruct 0.783
6320 qwen2.5-coder-32b-instruct 0.783
6448 qwen2.5-coder-32b-instruct 0.783
1048 qwen3-32b 0.762
1126 qwen2.5-coder-14b-instruct 0.714
1097 qwen2.5-coder-14b-instruct 0.714
6450 qwen1.5-72b-chat 0.683
6452 google_gemma_3_4b_it 0.644
976 google_gemma_3_4b_it 0.644
6475 mistralai_mathstral_7b_v0.1 0.622
939 mistralai_mathstral_7b_v0.1 0.622
1046 qwen2.5-coder-7b-instruct 0.599

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
4415 0.046 -0.396
4123 0.105 -0.322
4832 0.136 -0.320
857 0.080 -0.293
439 0.071 -0.271
4561 0.011 -0.260
4420 0.027 -0.256
4557 0.020 -0.253
734 0.165 -0.245
260 0.202 -0.236

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.