There are 112 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
DS/106, DS/107, DS/108, DS/121, DS/122, DS/132, DS/142, DS/15, DS/159, DS/165, DS/173, DS/174, DS/197, DS/202, DS/203, DS/204, DS/205, DS/208, DS/209, DS/210, DS/211, DS/216, DS/225, DS/228, DS/240, DS/242, DS/245, DS/269, DS/270, DS/272, DS/284, DS/285, DS/286, DS/29, DS/318, DS/319, DS/328, DS/339, DS/345, DS/354, DS/372, DS/375, DS/385, DS/387, DS/389, DS/390, DS/394, DS/40, DS/407, DS/408, DS/410, DS/42, DS/420, DS/421, DS/43, DS/439, DS/44, DS/45, DS/46, DS/468, DS/509, DS/516, DS/521, DS/526, DS/54, DS/57, DS/58, DS/59, DS/596, DS/60, DS/612, DS/626, DS/638, DS/65, DS/672, DS/699, DS/701, DS/726, DS/73, DS/74, DS/747, DS/749, DS/75, DS/750, DS/751, DS/755, DS/773, DS/779, DS/780, DS/789, DS/790, DS/798, DS/80, DS/808, DS/809, DS/81, DS/86, DS/877, DS/879, DS/88, DS/883, DS/884, DS/885, DS/9, DS/900, DS/901, DS/904, DS/905, DS/927, DS/96, DS/987, DS/993
| example_link | model | min_pass1_of_model |
|---|---|---|
| DS/105 | claude-3-5-sonnet-20240620 | 0.543 |
| DS/458 | claude-3-5-sonnet-20240620 | 0.543 |
| DS/6 | claude-3-5-sonnet-20240620 | 0.543 |
| DS/488 | claude-3-5-sonnet-20240620 | 0.543 |
| DS/79 | claude-3-5-sonnet-20240620 | 0.543 |
| DS/984 | claude-3-5-sonnet-20240620 | 0.543 |
| DS/922 | claude-3-5-sonnet-20240620 | 0.543 |
| DS/926 | claude-3-5-sonnet-20240620 | 0.543 |
| DS/744 | claude-3-5-sonnet-20240620 | 0.543 |
| DS/505 | gpt-4-turbo-2024-04-09 | 0.540 |
| DS/253 | gpt-4-turbo-2024-04-09 | 0.540 |
| DS/304 | gpt-4-turbo-2024-04-09 | 0.540 |
| DS/280 | deepseek-ai-deepseek-coder-V2-SFT | 0.532 |
| DS/56 | deepseek-ai-deepseek-coder-V2-SFT | 0.532 |
| DS/131 | Qwen-Qwen2-72B-Instruct | 0.528 |
| DS/244 | Qwen-Qwen2-72B-Instruct | 0.528 |
| DS/7 | Qwen-Qwen2-72B-Instruct | 0.528 |
| DS/418 | mistralai-Codestral-22B-v0.1 | 0.512 |
| DS/39 | gpt-4-0613 | 0.510 |
| DS/154 | gpt-4-0613 | 0.510 |
| DS/765 | gpt-4-0613 | 0.510 |
| DS/807 | gpt-4-0613 | 0.510 |
| DS/784 | gpt-4-0613 | 0.510 |
| DS/362 | meta-llama-Llama-3-70b-chat-hf | 0.486 |
| DS/346 | meta-llama-Llama-3-70b-chat-hf | 0.486 |
| DS/776 | meta-llama-Llama-3-70b-chat-hf | 0.486 |
| DS/781 | meta-llama-Llama-3-70b-chat-hf | 0.486 |
| DS/347 | meta-llama-Llama-3-70b-chat-hf | 0.486 |
| DS/772 | meta-llama-Llama-3-70b-chat-hf | 0.486 |
| DS/373 | deepseek-ai-deepseek-coder-V2-Base | 0.467 |
| DS/679 | microsoft-wavecoder-ultra-6.7b | 0.460 |
| DS/515 | meta-llama-Llama-3-70B | 0.409 |
| DS/26 | deepseek-ai-deepseek-llm-67b-chat | 0.407 |
| DS/799 | Phind-Phind-CodeLlama-34B-v2 | 0.404 |
| DS/447 | m-a-p-OpenCodeInterpreter-CL-7B | 0.395 |
| DS/87 | gpt-3.5-turbo-0125 | 0.394 |
| DS/766 | m-a-p-OpenCodeInterpreter-SC2-7B | 0.389 |
| DS/604 | m-a-p-OpenCodeInterpreter-SC2-7B | 0.389 |
| DS/998 | codellama-CodeLlama-34b-Python-hf | 0.389 |
| DS/411 | codellama-CodeLlama-70b-Python-hf | 0.389 |
| DS/997 | codellama-CodeLlama-34b-Python-hf | 0.389 |
| DS/813 | gpt-3.5-turbo-0613 | 0.386 |
| DS/172 | deepseek-ai-deepseek-V2-chat | 0.385 |
| DS/953 | microsoft-Phi-3-small-8k-instruct | 0.377 |
| DS/775 | WizardLM-WizardCoder-Python-34B-V1.0 | 0.367 |
| DS/67 | Qwen-Qwen1.5-72B-Chat | 0.355 |
| DS/90 | Qwen-Qwen1.5-72B-Chat | 0.355 |
| DS/263 | ibm-granite-granite-34b-code-base | 0.348 |
| DS/899 | meta-llama-Llama-3-8B | 0.315 |
| DS/161 | codellama-CodeLlama-7b-hf | 0.229 |
| DS/64 | ERNIE-Speed-8K | 0.088 |
| DS/523 | google-gemma-1.1-2b-it | 0.085 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | pass1_of_ex | tau |
|---|---|---|
| DS/523 | 0.010 | -0.109 |
| DS/64 | 0.010 | -0.104 |
| DS/585 | 0.019 | -0.102 |
| DS/880 | 0.210 | -0.096 |
| DS/611 | 0.314 | -0.094 |
| DS/762 | 0.190 | -0.089 |
| DS/250 | 0.019 | -0.076 |
| DS/161 | 0.010 | -0.043 |
| DS/514 | 0.076 | -0.039 |
| DS/882 | 0.114 | -0.030 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum win rate to solve each problem.