cruxeval_output_cot: by examples

Results Paper Code


Not solved by any model

There are 20 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
112, 113, 129, 163, 177, 218, 220, 229, 250, 254, 272, 301, 33, 438, 484, 543, 568, 581, 591, 671

Problems solved by 1 model only

example_link model min_pass1_of_model
149 qwen2.5-coder-32b-instruct 0.824
44 qwen2.5-coder-32b-instruct 0.824
514 qwen2.5-coder-32b-instruct 0.824
499 qwen3-14b 0.787
340 qwen3-14b 0.787
155 qwen2.5-coder-14b-instruct 0.765
280 qwen2-72b-instruct 0.641
444 qwen2-72b-instruct 0.641
698 qwen1.5-0.5b-chat 0.062

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
376 0.116 -0.193
571 0.003 -0.181
698 0.004 -0.181
779 0.026 -0.111
548 0.499 -0.074
140 0.383 -0.043
88 0.373 -0.040
638 0.163 -0.016
245 0.034 -0.014
556 0.005 0.014

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.