ap_cot: by examples

Results Paper Code

Not solved by any model

There are 0 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.

Problems solved by 1 model only

example_link	model	min_pass1_of_model

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
590	0.125	-0.448
568	0.246	-0.444
703	0.149	-0.392
330	0.131	-0.357
598	0.057	-0.328
195	0.046	-0.246
320	0.147	-0.241
67	0.051	-0.210
683	0.019	-0.165
274	0.084	-0.140

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.