bbh_cot: by examples

Results Paper Code


Not solved by any model

There are 132 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1003, 1008, 1010, 1011, 1012, 1024, 1025, 1031, 1032, 1039, 1043, 1046, 1049, 1057, 1058, 1059, 1076, 1077, 1082, 1090, 1092, 1111, 1121, 1123, 1126, 1131, 1132, 1135, 1136, 1137, 1146, 1149, 1152, 1162, 1163, 1164, 1172, 1174, 2850, 4182, 4227, 4551, 6262, 6272, 6273, 6276, 6277, 6285, 6286, 6289, 6294, 6296, 6305, 6310, 6316, 6318, 6319, 6320, 6322, 6323, 6328, 6330, 6331, 6335, 6337, 6340, 6341, 6342, 6347, 6348, 6350, 6351, 6362, 6363, 6365, 6366, 6367, 6375, 6376, 6377, 6380, 6387, 6391, 6398, 6399, 6401, 6404, 6407, 6410, 6418, 6426, 6430, 6431, 6433, 6434, 6438, 6440, 6442, 6443, 6444, 6447, 6448, 6450, 6452, 6457, 6458, 6459, 6461, 6464, 6475, 6476, 6480, 6482, 6485, 6488, 6490, 6494, 6496, 6498, 6499, 6502, 6507, 6508, 939, 949, 950, 951, 956, 965, 978, 984, 991

Problems solved by 1 model only

example_link model min_pass1_of_model
6298 google_gemma_3_12b_it 0.815
6356 google_gemma_3_12b_it 0.815
6354 google_gemma_3_12b_it 0.815
6301 google_gemma_3_12b_it 0.815
6332 google_gemma_3_12b_it 0.815
1632 google_gemma_3_12b_it 0.815
6405 google_gemma_3_12b_it 0.815
6501 google_gemma_3_12b_it 0.815
6470 google_gemma_3_12b_it 0.815
6412 google_gemma_3_12b_it 0.815
6378 google_gemma_3_12b_it 0.815
6358 google_gemma_3_12b_it 0.815
6372 google_gemma_3_12b_it 0.815
6419 llama-3.1-70B-instruct 0.812
6484 llama-3.1-70B-instruct 0.812
6352 llama-3.1-70B-instruct 0.812
6326 llama-3.1-70B-instruct 0.812
6388 qwen2.5-coder-32b-instruct 0.748
6460 qwen2.5-coder-32b-instruct 0.748
1048 qwen3-32b 0.733
1575 qwen3-32b 0.733
1108 qwen3-32b 0.733
1064 mistralai_mixtral_8x22b_instruct_v0.1 0.723
1107 qwen2-72b-instruct 0.718
6449 qwen2-72b-instruct 0.718
1173 qwen2-math-72b-instruct 0.686
6339 qwen2.5-coder-14b-instruct 0.637
6267 qwen2.5-coder-14b-instruct 0.637
1097 qwen2.5-coder-14b-instruct 0.637
6497 qwen1.5-32b-chat 0.609
6429 qwen1.5-32b-chat 0.609
1085 mistralai_mixtral_8x7b_instruct_v0.1 0.583
6425 mistralai_mathstral_7b_v0.1 0.530
1015 mistralai_mathstral_7b_v0.1 0.530
976 mistralai_mathstral_7b_v0.1 0.530
1072 qwen2-math-7b-instruct 0.519
1183 qwen2.5-coder-1.5b-instruct 0.287
1485 qwen2.5-coder-0.5b-instruct 0.217

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
4832 0.174 -0.345
4700 0.113 -0.267
734 0.130 -0.263
260 0.196 -0.258
4641 0.087 -0.246
601 0.087 -0.239
378 0.141 -0.232
4561 0.019 -0.225
383 0.201 -0.201
1548 0.059 -0.199

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.