cruxeval_input_cot: by examples

Results Paper Code


Not solved by any model

There are 5 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
129, 218, 220, 501, 620

Problems solved by 1 model only

example_link model min_pass1_of_model
229 qwen2.5-coder-32b-instruct 0.761
250 qwen2-72b-instruct 0.569
413 qwen2.5-coder-7b-instruct 0.549
444 qwen2.5-coder-3b-instruct 0.433
729 qwen1.5-32b-chat 0.414
112 qwen2-math-1.5b-instruct 0.169
177 qwen2-1.5b-instruct 0.151
185 qwen1.5-0.5b-chat 0.017
128 qwen1.5-0.5b-chat 0.017

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
236 0.006 -0.186
128 0.001 -0.181
185 0.001 -0.181
320 0.003 -0.161
113 0.004 -0.161
177 0.002 -0.142
438 0.086 -0.136
112 0.002 -0.135
687 0.003 -0.123
273 0.132 -0.119

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.