mgsm_cot: by examples

Results Paper Code

Not solved by any model

There are 22 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1110, 1284, 1295, 1312, 1337, 1350, 1389, 1421, 1459, 1587, 1594, 1600, 1628, 1832, 2062, 2087, 2226, 2323, 2563, 2661, 337, 359

Problems solved by 1 model only

example_link	model	min_pass1_of_model
1959	deepseek_r1_distill_llama_70b	0.858
2115	qwen3-14b	0.844
1869	qwen2-72b-instruct	0.794
2153	deepseek_r1_distill_qwen_32b	0.794
2086	qwen2.5-coder-32b-instruct	0.773
1128	llama-3.1-70B-instruct	0.770
1812	qwen3-4b	0.737
2012	google_gemma_3_4b_it	0.712
2337	google_gemma_3_4b_it	0.712
1934	qwen1.5-32b-chat	0.586
1316	mistralai_ministral_8b_instruct_2410	0.510
1087	qwen1.5-14b-chat	0.435
2554	google_gemma_3_1b_it	0.260
1499	google_gemma_7b_it	0.178
255	mistralai_mistral_7b_instruct_v0.1	0.121
1239	qwen2-1.5b-instruct	0.082
1139	qwen2.5-coder-0.5b-instruct	0.035
1014	qwen2-0.5b-instruct	0.030
2540	qwen1.5-0.5b-chat	0.014

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
1268	0.016	-0.278
2685	0.266	-0.264
1327	0.026	-0.249
2345	0.051	-0.244
2634	0.042	-0.240
2339	0.106	-0.238
1020	0.047	-0.236
964	0.049	-0.233
2542	0.221	-0.229
2540	0.010	-0.198

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.