aime2025_cot: by examples

Results Paper Code


Not solved by any model

There are 15 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1, 10, 11, 13, 14, 17, 21, 22, 23, 24, 25, 27, 4, 6, 9

Problems solved by 1 model only

example_link model min_pass1_of_model
26 deepseek_r1_distill_llama_70b 0.267
28 google_codegemma_1.1_7b_it 0.013
12 mistralai_mixtral_8x7b_instruct_v0.1 0.011

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
12 0.006 0.017
28 0.008 0.030
19 0.019 0.099
29 0.029 0.154
26 0.010 0.221
8 0.068 0.380
20 0.122 0.485
18 0.101 0.505
7 0.130 0.540
15 0.122 0.545

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.