ap_cot: by examples

Results Paper Code

Not solved by any model

There are 0 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.

Problems solved by 1 model only

example_link	model	min_pass1_of_model

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
568	0.258	-0.509
598	0.050	-0.470
590	0.130	-0.458
703	0.104	-0.338
558	0.068	-0.299
67	0.035	-0.289
683	0.013	-0.288
195	0.079	-0.215
172	0.170	-0.176
274	0.113	-0.163

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.