leetcode: by examples

Results Paper Code


Not solved by any model

There are 52 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
0, 108, 110, 113, 122, 137, 141, 142, 145, 146, 147, 149, 152, 154, 156, 158, 160, 164, 165, 166, 168, 169, 172, 173, 174, 176, 177, 4, 43, 46, 48, 50, 51, 52, 57, 58, 62, 66, 68, 69, 70, 74, 76, 78, 79, 84, 86, 89, 90, 91, 93, 95

Problems solved by 1 model only

example_link model min_pass1_of_model
100 google_gemma_3_27b_it 0.401
104 google_gemma_3_27b_it 0.401
116 google_gemma_3_27b_it 0.401
124 google_gemma_3_27b_it 0.401
121 google_gemma_3_27b_it 0.401
119 google_gemma_3_27b_it 0.401
159 google_gemma_3_27b_it 0.401
151 google_gemma_3_27b_it 0.401
144 google_gemma_3_27b_it 0.401
59 google_gemma_3_27b_it 0.401
47 google_gemma_3_27b_it 0.401
26 google_gemma_3_27b_it 0.401
167 google_gemma_3_27b_it 0.401
163 google_gemma_3_27b_it 0.401
94 google_gemma_3_27b_it 0.401
96 google_gemma_3_27b_it 0.401
97 google_gemma_3_27b_it 0.401
80 google_gemma_3_27b_it 0.401
103 google_gemma_2_27b_it 0.201
99 google_gemma_2_27b_it 0.201
56 google_gemma_2_27b_it 0.201
130 google_gemma_2_27b_it 0.201
82 google_gemma_3_4b_it 0.175
5 google_gemma_3_4b_it 0.175
111 google_gemma_2_9b_it 0.168
73 google_gemma_2_9b_it 0.168
77 google_gemma_3_12b_it 0.070
118 google_gemma_3_12b_it 0.070
22 google_codegemma_1.1_7b_it 0.063
102 qwen2.5-coder-3b-instruct 0.000

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
102 0.002 0.101
22 0.002 0.191
118 0.004 0.201
77 0.002 0.201
73 0.002 0.223
111 0.002 0.223
82 0.002 0.233
5 0.014 0.233
130 0.002 0.254
56 0.008 0.254

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.