leetcode: by examples

Results Paper Code


Not solved by any model

There are 70 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
0, 102, 103, 108, 110, 111, 113, 116, 118, 119, 122, 130, 136, 137, 14, 141, 142, 144, 145, 146, 147, 149, 152, 154, 155, 156, 158, 160, 164, 165, 166, 168, 169, 173, 174, 176, 177, 32, 4, 45, 48, 50, 51, 52, 57, 58, 61, 62, 66, 68, 69, 70, 73, 74, 76, 78, 79, 80, 82, 84, 86, 88, 89, 90, 91, 93, 95, 96, 97, 99

Problems solved by 1 model only

example_link model min_pass1_of_model
1 google_gemma_3_27b_it 0.402
100 google_gemma_3_27b_it 0.402
104 google_gemma_3_27b_it 0.402
106 google_gemma_3_27b_it 0.402
12 google_gemma_3_27b_it 0.402
124 google_gemma_3_27b_it 0.402
121 google_gemma_3_27b_it 0.402
163 google_gemma_3_27b_it 0.402
159 google_gemma_3_27b_it 0.402
59 google_gemma_3_27b_it 0.402
63 google_gemma_3_27b_it 0.402
167 google_gemma_3_27b_it 0.402
26 google_gemma_3_27b_it 0.402
55 google_gemma_3_27b_it 0.402
47 google_gemma_3_27b_it 0.402
46 google_gemma_3_27b_it 0.402
43 google_gemma_3_27b_it 0.402
71 google_gemma_3_27b_it 0.402
94 google_gemma_3_27b_it 0.402
126 google_gemma_2_27b_it 0.222
117 google_gemma_2_27b_it 0.222
25 google_gemma_2_9b_it 0.150
162 google_gemma_2_9b_it 0.150
22 google_codegemma_1.1_7b_it 0.050
172 qwen2.5-coder-32b-instruct 0.003

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
172 0.010 0.145
22 0.004 0.200
162 0.005 0.245
25 0.005 0.245
126 0.010 0.267
117 0.010 0.267
94 0.013 0.278
1 0.020 0.278
55 0.020 0.278
159 0.013 0.278

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.