gsm8k_cot: by examples

Results Paper Code

Not solved by any model

There are 1 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1042

Problems solved by 1 model only

example_link	model	min_pass1_of_model
782	qwen2.5-coder-14b-instruct	0.810
823	qwen2-1.5b-instruct	0.205
454	qwen2.5-coder-0.5b-instruct	0.098

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
749	0.033	-0.284
454	0.004	-0.173
823	0.005	-0.158
962	0.068	-0.122
1048	0.024	-0.112
403	0.072	-0.071
119	0.038	-0.057
1016	0.086	-0.056
952	0.073	-0.045
12	0.049	-0.023

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.