cruxeval_input_cot: by examples

Results Paper Code

Not solved by any model

There are 5 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
129, 218, 220, 501, 620

Problems solved by 1 model only

example_link	model	min_pass1_of_model
229	qwen2.5-coder-32b-instruct	0.761
250	qwen2-72b-instruct	0.569
413	qwen2.5-coder-7b-instruct	0.549
444	qwen2.5-coder-3b-instruct	0.433
729	qwen1.5-32b-chat	0.414
112	qwen2-math-1.5b-instruct	0.169
177	qwen2-1.5b-instruct	0.151
185	qwen1.5-0.5b-chat	0.017
128	qwen1.5-0.5b-chat	0.017

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
236	0.006	-0.186
128	0.001	-0.181
185	0.001	-0.181
320	0.003	-0.161
113	0.004	-0.161
177	0.002	-0.142
438	0.086	-0.136
112	0.002	-0.135
687	0.003	-0.123
273	0.132	-0.119

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.