human_eval: by examples

Results Paper Code


Not solved by any model

There are 1 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
145

Problems solved by 1 model only

example_link model min_pass1_of_model
132 qwen3-14b 0.860

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
127 0.153 0.090
132 0.002 0.182
83 0.061 0.231
121 0.501 0.232
160 0.086 0.235
54 0.421 0.249
65 0.130 0.285
35 0.766 0.288
163 0.028 0.339
116 0.446 0.341

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.