cruxeval_output_cot: by examples

Results Paper Code


Not solved by any model

There are 35 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
112, 113, 125, 129, 136, 149, 163, 177, 198, 211, 218, 220, 229, 239, 250, 254, 272, 280, 301, 307, 310, 33, 340, 375, 438, 44, 543, 556, 568, 581, 591, 640, 671, 698, 723

Problems solved by 1 model only

example_link model min_pass1_of_model
599 qwen2.5-coder-14b-instruct 0.762
514 qwen2.5-coder-14b-instruct 0.762
5 qwen3-14b 0.762
622 qwen3-14b 0.762
749 google_gemma_3_27b_it 0.759
317 google_gemma_3_27b_it 0.759
155 google_gemma_3_12b_it 0.694
148 google_gemma_2_27b_it 0.599
491 google_gemma_3_4b_it 0.549
499 qwen2.5-coder-3b-instruct 0.445
484 qwen1.5-7b-chat 0.248

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
376 0.136 -0.068
484 0.006 -0.065
88 0.270 -0.061
245 0.025 -0.021
233 0.113 0.050
499 0.005 0.058
444 0.013 0.065
131 0.094 0.075
571 0.034 0.076
114 0.041 0.086

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.