bbh_cot: by examples

Results Paper Code

Not solved by any model

There are 76 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1003, 1008, 1010, 1011, 1012, 1024, 1025, 1032, 1039, 1049, 1057, 1072, 1082, 1107, 1111, 1121, 1123, 1132, 1135, 1137, 1146, 1149, 1152, 1162, 1164, 1172, 1174, 2850, 4182, 4227, 6262, 6267, 6272, 6276, 6277, 6285, 6286, 6294, 6296, 6298, 6310, 6316, 6318, 6319, 6328, 6335, 6342, 6347, 6350, 6351, 6354, 6362, 6366, 6367, 6376, 6377, 6387, 6391, 6398, 6401, 6418, 6430, 6433, 6434, 6438, 6442, 6447, 6457, 6461, 6476, 6494, 6507, 949, 950, 956, 984

Problems solved by 1 model only

example_link	model	min_pass1_of_model
1092	google_gemma_3_12b_it	0.818
1077	google_gemma_3_12b_it	0.818
6332	google_gemma_3_12b_it	0.818
6358	google_gemma_3_12b_it	0.818
6326	google_gemma_3_12b_it	0.818
6330	google_gemma_3_12b_it	0.818
6407	google_gemma_3_12b_it	0.818
6363	qwen2-72b-instruct	0.790
6497	qwen2-72b-instruct	0.790
1059	qwen2-72b-instruct	0.790
965	qwen2-72b-instruct	0.790
6341	qwen2-72b-instruct	0.790
6273	qwen2-72b-instruct	0.790
6305	qwen2-72b-instruct	0.790
6375	qwen2-72b-instruct	0.790
6440	qwen2-72b-instruct	0.790
6426	qwen2-72b-instruct	0.790
6459	qwen2-72b-instruct	0.790
6449	qwen2-72b-instruct	0.790
6410	qwen2-72b-instruct	0.790
6502	qwen2.5-coder-32b-instruct	0.783
6388	qwen2.5-coder-32b-instruct	0.783
6488	qwen2.5-coder-32b-instruct	0.783
6480	qwen2.5-coder-32b-instruct	0.783
6331	qwen2.5-coder-32b-instruct	0.783
6340	qwen2.5-coder-32b-instruct	0.783
6322	qwen2.5-coder-32b-instruct	0.783
6320	qwen2.5-coder-32b-instruct	0.783
6448	qwen2.5-coder-32b-instruct	0.783
1048	qwen3-32b	0.762
1126	qwen2.5-coder-14b-instruct	0.714
1097	qwen2.5-coder-14b-instruct	0.714
6450	qwen1.5-72b-chat	0.683
6452	google_gemma_3_4b_it	0.644
976	google_gemma_3_4b_it	0.644
6475	mistralai_mathstral_7b_v0.1	0.622
939	mistralai_mathstral_7b_v0.1	0.622
1046	qwen2.5-coder-7b-instruct	0.599

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
4415	0.046	-0.396
4123	0.105	-0.322
4832	0.136	-0.320
857	0.080	-0.293
439	0.071	-0.271
4561	0.011	-0.260
4420	0.027	-0.256
4557	0.020	-0.253
734	0.165	-0.245
260	0.202	-0.236

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.