mgsm_cot: by examples

Results Paper Code


Not solved by any model

There are 10 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1128, 1139, 1295, 1350, 1389, 1459, 1499, 1628, 2323, 337

Problems solved by 1 model only

example_link model min_pass1_of_model
1594 llama-3.1-70B-instruct 0.861
2062 qwen2.5-coder-14b-instruct 0.688
2337 qwen1.5-32b-chat 0.592
1087 qwen2-7b-instruct 0.570
255 qwen2-7b-instruct 0.570
1110 qwen1.5-7b-chat 0.319
1832 qwen1.5-7b-chat 0.319
1284 mistralai_mistral_7b_instruct_v0.1 0.177
2554 qwen2-1.5b-instruct 0.146

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
2542 0.195 -0.422
1327 0.048 -0.410
2634 0.021 -0.353
2589 0.207 -0.340
1339 0.133 -0.337
2339 0.109 -0.329
345 0.026 -0.320
370 0.036 -0.309
843 0.013 -0.291
1589 0.132 -0.289

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.