gpqa_cot: by examples

Results Paper Code

Not solved by any model

There are 0 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.

Problems solved by 1 model only

example_link	model	min_pass1_of_model

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
201	0.150	-0.346
148	0.103	-0.315
336	0.144	-0.292
432	0.079	-0.277
441	0.135	-0.260
88	0.122	-0.247
214	0.043	-0.242
230	0.080	-0.237
170	0.037	-0.234
130	0.140	-0.233

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.