bbh_cot: by examples

Results Paper Code

Not solved by any model

There are 132 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1003, 1008, 1010, 1011, 1012, 1024, 1025, 1031, 1032, 1039, 1043, 1046, 1049, 1057, 1058, 1059, 1076, 1077, 1082, 1090, 1092, 1111, 1121, 1123, 1126, 1131, 1132, 1135, 1136, 1137, 1146, 1149, 1152, 1162, 1163, 1164, 1172, 1174, 2850, 4182, 4227, 4551, 6262, 6272, 6273, 6276, 6277, 6285, 6286, 6289, 6294, 6296, 6305, 6310, 6316, 6318, 6319, 6320, 6322, 6323, 6328, 6330, 6331, 6335, 6337, 6340, 6341, 6342, 6347, 6348, 6350, 6351, 6362, 6363, 6365, 6366, 6367, 6375, 6376, 6377, 6380, 6387, 6391, 6398, 6399, 6401, 6404, 6407, 6410, 6418, 6426, 6430, 6431, 6433, 6434, 6438, 6440, 6442, 6443, 6444, 6447, 6448, 6450, 6452, 6457, 6458, 6459, 6461, 6464, 6475, 6476, 6480, 6482, 6485, 6488, 6490, 6494, 6496, 6498, 6499, 6502, 6507, 6508, 939, 949, 950, 951, 956, 965, 978, 984, 991

Problems solved by 1 model only

example_link	model	min_pass1_of_model
6298	google_gemma_3_12b_it	0.815
6356	google_gemma_3_12b_it	0.815
6354	google_gemma_3_12b_it	0.815
6301	google_gemma_3_12b_it	0.815
6332	google_gemma_3_12b_it	0.815
1632	google_gemma_3_12b_it	0.815
6405	google_gemma_3_12b_it	0.815
6501	google_gemma_3_12b_it	0.815
6470	google_gemma_3_12b_it	0.815
6412	google_gemma_3_12b_it	0.815
6378	google_gemma_3_12b_it	0.815
6358	google_gemma_3_12b_it	0.815
6372	google_gemma_3_12b_it	0.815
6419	llama-3.1-70B-instruct	0.812
6484	llama-3.1-70B-instruct	0.812
6352	llama-3.1-70B-instruct	0.812
6326	llama-3.1-70B-instruct	0.812
6388	qwen2.5-coder-32b-instruct	0.748
6460	qwen2.5-coder-32b-instruct	0.748
1048	qwen3-32b	0.733
1575	qwen3-32b	0.733
1108	qwen3-32b	0.733
1064	mistralai_mixtral_8x22b_instruct_v0.1	0.723
1107	qwen2-72b-instruct	0.718
6449	qwen2-72b-instruct	0.718
1173	qwen2-math-72b-instruct	0.686
6339	qwen2.5-coder-14b-instruct	0.637
6267	qwen2.5-coder-14b-instruct	0.637
1097	qwen2.5-coder-14b-instruct	0.637
6497	qwen1.5-32b-chat	0.609
6429	qwen1.5-32b-chat	0.609
1085	mistralai_mixtral_8x7b_instruct_v0.1	0.583
6425	mistralai_mathstral_7b_v0.1	0.530
1015	mistralai_mathstral_7b_v0.1	0.530
976	mistralai_mathstral_7b_v0.1	0.530
1072	qwen2-math-7b-instruct	0.519
1183	qwen2.5-coder-1.5b-instruct	0.287
1485	qwen2.5-coder-0.5b-instruct	0.217

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
4832	0.174	-0.345
4700	0.113	-0.267
734	0.130	-0.263
260	0.196	-0.258
4641	0.087	-0.246
601	0.087	-0.239
378	0.141	-0.232
4561	0.019	-0.225
383	0.201	-0.201
1548	0.059	-0.199

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.