gsm8k_cot: by examples

Results Paper Code


Not solved by any model

There are 2 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
454, 823

Problems solved by 1 model only

example_link model min_pass1_of_model
1042 google_gemma_2b_it 0.099

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
749 0.063 -0.373
962 0.084 -0.253
1309 0.026 -0.210
835 0.059 -0.206
952 0.116 -0.197
1042 0.002 -0.188
403 0.103 -0.077
89 0.384 -0.069
1048 0.021 -0.068
368 0.013 -0.045

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.