cruxeval_input_cot: by examples

Results Paper Code

Not solved by any model

There are 19 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
113, 128, 129, 177, 185, 200, 218, 220, 295, 314, 322, 423, 474, 501, 53, 545, 556, 620, 729

Problems solved by 1 model only

example_link	model	min_pass1_of_model
148	qwen2.5-coder-32b-instruct	0.742
229	qwen2.5-coder-32b-instruct	0.742
250	qwen2.5-coder-32b-instruct	0.742
268	qwen2.5-coder-32b-instruct	0.742
320	qwen2.5-coder-32b-instruct	0.742
687	qwen2.5-coder-32b-instruct	0.742
550	qwen2.5-coder-14b-instruct	0.627
413	qwen2.5-coder-14b-instruct	0.627
254	google_gemma_2_27b_it	0.527
770	mistralai_mixtral_8x22b_instruct_v0.1	0.485
257	qwen3-32b	0.313
540	qwen2.5-coder-1.5b-instruct	0.240
581	llama-3.2-3B-instruct	0.212
33	deepseek_v2_lite_chat	0.212
491	qwen3-8b	0.186
236	qwen2.5-coder-0.5b-instruct	0.180
444	qwen2.5-coder-0.5b-instruct	0.180
469	qwen2-1.5b-instruct	0.067

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
547	0.031	-0.184
469	0.005	-0.135
444	0.004	-0.081
236	0.004	-0.081
491	0.006	-0.073
33	0.006	-0.065
581	0.019	-0.058
179	0.013	-0.047
112	0.017	-0.047
199	0.074	-0.036

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.