cruxeval_output_cot: by examples

Results Paper Code

Not solved by any model

There are 35 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
112, 113, 125, 129, 136, 149, 163, 177, 198, 211, 218, 220, 229, 239, 250, 254, 272, 280, 301, 307, 310, 33, 340, 375, 438, 44, 543, 556, 568, 581, 591, 640, 671, 698, 723

Problems solved by 1 model only

example_link	model	min_pass1_of_model
599	qwen2.5-coder-14b-instruct	0.762
514	qwen2.5-coder-14b-instruct	0.762
5	qwen3-14b	0.762
622	qwen3-14b	0.762
749	google_gemma_3_27b_it	0.759
317	google_gemma_3_27b_it	0.759
155	google_gemma_3_12b_it	0.694
148	google_gemma_2_27b_it	0.599
491	google_gemma_3_4b_it	0.549
499	qwen2.5-coder-3b-instruct	0.445
484	qwen1.5-7b-chat	0.248

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
376	0.136	-0.068
484	0.006	-0.065
88	0.270	-0.061
245	0.025	-0.021
233	0.113	0.050
499	0.005	0.058
444	0.013	0.065
131	0.094	0.075
571	0.034	0.076
114	0.041	0.086

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.