gpqa_cot: by examples

Results Paper Code


Not solved by any model

There are 0 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.

Problems solved by 1 model only

example_link model min_pass1_of_model

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
88 0.122 -0.421
25 0.173 -0.397
57 0.035 -0.394
245 0.068 -0.387
250 0.039 -0.380
170 0.013 -0.318
181 0.052 -0.310
121 0.018 -0.307
226 0.096 -0.307
202 0.093 -0.295

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.