gsm8k_cot: by examples

Results Paper Code


Not solved by any model

There are 1 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1042

Problems solved by 1 model only

example_link model min_pass1_of_model
782 qwen2.5-coder-14b-instruct 0.810
823 qwen2-1.5b-instruct 0.205
454 qwen2.5-coder-0.5b-instruct 0.098

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
749 0.033 -0.284
454 0.004 -0.173
823 0.005 -0.158
962 0.068 -0.122
1048 0.024 -0.112
403 0.072 -0.071
119 0.038 -0.057
1016 0.086 -0.056
952 0.073 -0.045
12 0.049 -0.023

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.