gpqa_cot: by examples

Results Paper Code

Not solved by any model

There are 0 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.

Problems solved by 1 model only

example_link	model	min_pass1_of_model

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
88	0.122	-0.421
25	0.173	-0.397
57	0.035	-0.394
245	0.068	-0.387
250	0.039	-0.380
170	0.013	-0.318
181	0.052	-0.310
121	0.018	-0.307
226	0.096	-0.307
202	0.093	-0.295

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.