gsm8k_cot: by examples

Results Paper Code

Not solved by any model

There are 2 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
454, 823

Problems solved by 1 model only

example_link	model	min_pass1_of_model
1042	google_gemma_2b_it	0.099

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
749	0.063	-0.373
962	0.084	-0.253
1309	0.026	-0.210
835	0.059	-0.206
952	0.116	-0.197
1042	0.002	-0.188
403	0.103	-0.077
89	0.384	-0.069
1048	0.021	-0.068
368	0.013	-0.045

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.