mgsm_cot: by examples

Results Paper Code

Not solved by any model

There are 10 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1128, 1139, 1295, 1350, 1389, 1459, 1499, 1628, 2323, 337

Problems solved by 1 model only

example_link	model	min_pass1_of_model
1594	llama-3.1-70B-instruct	0.861
2062	qwen2.5-coder-14b-instruct	0.688
2337	qwen1.5-32b-chat	0.592
1087	qwen2-7b-instruct	0.570
255	qwen2-7b-instruct	0.570
1110	qwen1.5-7b-chat	0.319
1832	qwen1.5-7b-chat	0.319
1284	mistralai_mistral_7b_instruct_v0.1	0.177
2554	qwen2-1.5b-instruct	0.146

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
2542	0.195	-0.422
1327	0.048	-0.410
2634	0.021	-0.353
2589	0.207	-0.340
1339	0.133	-0.337
2339	0.109	-0.329
345	0.026	-0.320
370	0.036	-0.309
843	0.013	-0.291
1589	0.132	-0.289

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.