leetcode: by examples

Results Paper Code

Not solved by any model

There are 52 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
0, 108, 110, 113, 122, 137, 141, 142, 145, 146, 147, 149, 152, 154, 156, 158, 160, 164, 165, 166, 168, 169, 172, 173, 174, 176, 177, 4, 43, 46, 48, 50, 51, 52, 57, 58, 62, 66, 68, 69, 70, 74, 76, 78, 79, 84, 86, 89, 90, 91, 93, 95

Problems solved by 1 model only

example_link	model	min_pass1_of_model
100	google_gemma_3_27b_it	0.401
104	google_gemma_3_27b_it	0.401
116	google_gemma_3_27b_it	0.401
124	google_gemma_3_27b_it	0.401
121	google_gemma_3_27b_it	0.401
119	google_gemma_3_27b_it	0.401
159	google_gemma_3_27b_it	0.401
151	google_gemma_3_27b_it	0.401
144	google_gemma_3_27b_it	0.401
59	google_gemma_3_27b_it	0.401
47	google_gemma_3_27b_it	0.401
26	google_gemma_3_27b_it	0.401
167	google_gemma_3_27b_it	0.401
163	google_gemma_3_27b_it	0.401
94	google_gemma_3_27b_it	0.401
96	google_gemma_3_27b_it	0.401
97	google_gemma_3_27b_it	0.401
80	google_gemma_3_27b_it	0.401
103	google_gemma_2_27b_it	0.201
99	google_gemma_2_27b_it	0.201
56	google_gemma_2_27b_it	0.201
130	google_gemma_2_27b_it	0.201
82	google_gemma_3_4b_it	0.175
5	google_gemma_3_4b_it	0.175
111	google_gemma_2_9b_it	0.168
73	google_gemma_2_9b_it	0.168
77	google_gemma_3_12b_it	0.070
118	google_gemma_3_12b_it	0.070
22	google_codegemma_1.1_7b_it	0.063
102	qwen2.5-coder-3b-instruct	0.000

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
102	0.002	0.101
22	0.002	0.191
118	0.004	0.201
77	0.002	0.201
73	0.002	0.223
111	0.002	0.223
82	0.002	0.233
5	0.014	0.233
130	0.002	0.254
56	0.008	0.254

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.