gpqa_cot: by examples

Results Paper Code


Not solved by any model

There are 0 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.

Problems solved by 1 model only

example_link model min_pass1_of_model

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
201 0.150 -0.346
148 0.103 -0.315
336 0.144 -0.292
432 0.079 -0.277
441 0.135 -0.260
88 0.122 -0.247
214 0.043 -0.242
230 0.080 -0.237
170 0.037 -0.234
130 0.140 -0.233

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.