DS1000: by examples

Results Paper Code

Not solved by any model

There are 112 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
DS/106, DS/107, DS/108, DS/121, DS/122, DS/132, DS/142, DS/15, DS/159, DS/165, DS/173, DS/174, DS/197, DS/202, DS/203, DS/204, DS/205, DS/208, DS/209, DS/210, DS/211, DS/216, DS/225, DS/228, DS/240, DS/242, DS/245, DS/269, DS/270, DS/272, DS/284, DS/285, DS/286, DS/29, DS/318, DS/319, DS/328, DS/339, DS/345, DS/354, DS/372, DS/375, DS/385, DS/387, DS/389, DS/390, DS/394, DS/40, DS/407, DS/408, DS/410, DS/42, DS/420, DS/421, DS/43, DS/439, DS/44, DS/45, DS/46, DS/468, DS/509, DS/516, DS/521, DS/526, DS/54, DS/57, DS/58, DS/59, DS/596, DS/60, DS/612, DS/626, DS/638, DS/65, DS/672, DS/699, DS/701, DS/726, DS/73, DS/74, DS/747, DS/749, DS/75, DS/750, DS/751, DS/755, DS/773, DS/779, DS/780, DS/789, DS/790, DS/798, DS/80, DS/808, DS/809, DS/81, DS/86, DS/877, DS/879, DS/88, DS/883, DS/884, DS/885, DS/9, DS/900, DS/901, DS/904, DS/905, DS/927, DS/96, DS/987, DS/993

Problems solved by 1 model only

example_link	model	min_pass1_of_model
DS/105	claude-3-5-sonnet-20240620	0.543
DS/458	claude-3-5-sonnet-20240620	0.543
DS/6	claude-3-5-sonnet-20240620	0.543
DS/488	claude-3-5-sonnet-20240620	0.543
DS/79	claude-3-5-sonnet-20240620	0.543
DS/984	claude-3-5-sonnet-20240620	0.543
DS/922	claude-3-5-sonnet-20240620	0.543
DS/926	claude-3-5-sonnet-20240620	0.543
DS/744	claude-3-5-sonnet-20240620	0.543
DS/505	gpt-4-turbo-2024-04-09	0.540
DS/253	gpt-4-turbo-2024-04-09	0.540
DS/304	gpt-4-turbo-2024-04-09	0.540
DS/280	deepseek-ai-deepseek-coder-V2-SFT	0.532
DS/56	deepseek-ai-deepseek-coder-V2-SFT	0.532
DS/131	Qwen-Qwen2-72B-Instruct	0.528
DS/244	Qwen-Qwen2-72B-Instruct	0.528
DS/7	Qwen-Qwen2-72B-Instruct	0.528
DS/418	mistralai-Codestral-22B-v0.1	0.512
DS/39	gpt-4-0613	0.510
DS/154	gpt-4-0613	0.510
DS/765	gpt-4-0613	0.510
DS/807	gpt-4-0613	0.510
DS/784	gpt-4-0613	0.510
DS/362	meta-llama-Llama-3-70b-chat-hf	0.486
DS/346	meta-llama-Llama-3-70b-chat-hf	0.486
DS/776	meta-llama-Llama-3-70b-chat-hf	0.486
DS/781	meta-llama-Llama-3-70b-chat-hf	0.486
DS/347	meta-llama-Llama-3-70b-chat-hf	0.486
DS/772	meta-llama-Llama-3-70b-chat-hf	0.486
DS/373	deepseek-ai-deepseek-coder-V2-Base	0.467
DS/679	microsoft-wavecoder-ultra-6.7b	0.460
DS/515	meta-llama-Llama-3-70B	0.409
DS/26	deepseek-ai-deepseek-llm-67b-chat	0.407
DS/799	Phind-Phind-CodeLlama-34B-v2	0.404
DS/447	m-a-p-OpenCodeInterpreter-CL-7B	0.395
DS/87	gpt-3.5-turbo-0125	0.394
DS/766	m-a-p-OpenCodeInterpreter-SC2-7B	0.389
DS/604	m-a-p-OpenCodeInterpreter-SC2-7B	0.389
DS/998	codellama-CodeLlama-34b-Python-hf	0.389
DS/411	codellama-CodeLlama-70b-Python-hf	0.389
DS/997	codellama-CodeLlama-34b-Python-hf	0.389
DS/813	gpt-3.5-turbo-0613	0.386
DS/172	deepseek-ai-deepseek-V2-chat	0.385
DS/953	microsoft-Phi-3-small-8k-instruct	0.377
DS/775	WizardLM-WizardCoder-Python-34B-V1.0	0.367
DS/67	Qwen-Qwen1.5-72B-Chat	0.355
DS/90	Qwen-Qwen1.5-72B-Chat	0.355
DS/263	ibm-granite-granite-34b-code-base	0.348
DS/899	meta-llama-Llama-3-8B	0.315
DS/161	codellama-CodeLlama-7b-hf	0.229
DS/64	ERNIE-Speed-8K	0.088
DS/523	google-gemma-1.1-2b-it	0.085

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
DS/523	0.010	-0.109
DS/64	0.010	-0.104
DS/585	0.019	-0.102
DS/880	0.210	-0.096
DS/611	0.314	-0.094
DS/762	0.190	-0.089
DS/250	0.019	-0.076
DS/161	0.010	-0.043
DS/514	0.076	-0.039
DS/882	0.114	-0.030

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.