aime2025_cot: by examples

Results Paper Code

Not solved by any model

There are 8 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1, 10, 12, 14, 17, 25, 27, 6

Problems solved by 1 model only

example_link	model	min_pass1_of_model
28	deepseek_r1_distill_llama_70b	0.258
11	qwen3-14b	0.181
9	qwen3-4b	0.164
29	qwen2.5-coder-14b-instruct	0.078
22	qwen2-72b-instruct	0.036
13	qwen1.5-72b-chat	0.003

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
13	0.002	-0.049
22	0.002	0.061
29	0.002	0.069
9	0.002	0.117
11	0.002	0.142
28	0.002	0.206
23	0.021	0.279
24	0.005	0.285
19	0.018	0.287
21	0.012	0.288

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.