cruxeval_output_cot: by examples

Results Paper Code

Not solved by any model

There are 20 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
112, 113, 129, 163, 177, 218, 220, 229, 250, 254, 272, 301, 33, 438, 484, 543, 568, 581, 591, 671

Problems solved by 1 model only

example_link	model	min_pass1_of_model
149	qwen2.5-coder-32b-instruct	0.824
44	qwen2.5-coder-32b-instruct	0.824
514	qwen2.5-coder-32b-instruct	0.824
499	qwen3-14b	0.787
340	qwen3-14b	0.787
155	qwen2.5-coder-14b-instruct	0.765
280	qwen2-72b-instruct	0.641
444	qwen2-72b-instruct	0.641
698	qwen1.5-0.5b-chat	0.062

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
376	0.116	-0.193
571	0.003	-0.181
698	0.004	-0.181
779	0.026	-0.111
548	0.499	-0.074
140	0.383	-0.043
88	0.373	-0.040
638	0.163	-0.016
245	0.034	-0.014
556	0.005	0.014

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.