leetcode: by examples

Results Paper Code

Not solved by any model

There are 70 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
0, 102, 103, 108, 110, 111, 113, 116, 118, 119, 122, 130, 136, 137, 14, 141, 142, 144, 145, 146, 147, 149, 152, 154, 155, 156, 158, 160, 164, 165, 166, 168, 169, 173, 174, 176, 177, 32, 4, 45, 48, 50, 51, 52, 57, 58, 61, 62, 66, 68, 69, 70, 73, 74, 76, 78, 79, 80, 82, 84, 86, 88, 89, 90, 91, 93, 95, 96, 97, 99

Problems solved by 1 model only

example_link	model	min_pass1_of_model
1	google_gemma_3_27b_it	0.402
100	google_gemma_3_27b_it	0.402
104	google_gemma_3_27b_it	0.402
106	google_gemma_3_27b_it	0.402
12	google_gemma_3_27b_it	0.402
124	google_gemma_3_27b_it	0.402
121	google_gemma_3_27b_it	0.402
163	google_gemma_3_27b_it	0.402
159	google_gemma_3_27b_it	0.402
59	google_gemma_3_27b_it	0.402
63	google_gemma_3_27b_it	0.402
167	google_gemma_3_27b_it	0.402
26	google_gemma_3_27b_it	0.402
55	google_gemma_3_27b_it	0.402
47	google_gemma_3_27b_it	0.402
46	google_gemma_3_27b_it	0.402
43	google_gemma_3_27b_it	0.402
71	google_gemma_3_27b_it	0.402
94	google_gemma_3_27b_it	0.402
126	google_gemma_2_27b_it	0.222
117	google_gemma_2_27b_it	0.222
25	google_gemma_2_9b_it	0.150
162	google_gemma_2_9b_it	0.150
22	google_codegemma_1.1_7b_it	0.050
172	qwen2.5-coder-32b-instruct	0.003

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
172	0.010	0.145
22	0.004	0.200
162	0.005	0.245
25	0.005	0.245
126	0.010	0.267
117	0.010	0.267
94	0.013	0.278
1	0.020	0.278
55	0.020	0.278
159	0.013	0.278

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.