gsm8k_plus_cot: by examples

Results Paper Code

Not solved by any model

There are 82 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
10380, 1056, 1416, 1444, 1585, 1611, 164, 1819, 1932, 2122, 2344, 2507, 2564, 2692, 2778, 2866, 3067, 3112, 3148, 3275, 3377, 3409, 3464, 3610, 3637, 3638, 3904, 3954, 3961, 3973, 4033, 4222, 4401, 442, 4432, 450, 4587, 4722, 4725, 4726, 4884, 4892, 4963, 502, 5120, 5232, 5297, 5403, 5788, 5913, 6312, 6467, 6537, 6779, 6809, 6844, 690, 696, 7012, 7233, 7356, 7552, 7705, 7995, 8011, 8013, 8040, 8128, 8147, 8336, 8337, 8341, 8386, 8387, 843, 8562, 8706, 9096, 936, 9404, 980, 9873

Problems solved by 1 model only

example_link	model	min_pass1_of_model
7887	deepseek_r1_distill_qwen_32b	0.751
3507	deepseek_r1_distill_qwen_14b	0.745
1972	deepseek_r1_distill_qwen_7b	0.733
6262	qwen2.5-coder-14b-instruct	0.709
10288	mistralai_mixtral_8x22b_instruct_v0.1	0.696
5052	mistralai_mixtral_8x22b_instruct_v0.1	0.696
5512	qwen1.5-32b-chat	0.671
2121	deepseek_r1_distill_llama_8b	0.668
2748	qwen2-math-7b-instruct	0.655
8531	qwen2-math-7b-instruct	0.655
3281	mistralai_ministral_8b_instruct_2410	0.646
9329	google_gemma_3_4b_it	0.643
4720	qwen2.5-coder-7b-instruct	0.614
7619	qwen2.5-coder-7b-instruct	0.614
1816	mistralai_mathstral_7b_v0.1	0.577
6865	qwen1.5-14b-chat	0.557
5059	qwen1.5-14b-chat	0.557
8027	qwen1.5-14b-chat	0.557
4148	qwen1.5-14b-chat	0.557
2504	deepseek_v2_lite_chat	0.499
9635	deepseek_v2_lite_chat	0.499
9276	mistralai_mixtral_8x7b_instruct_v0.1	0.492
6035	qwen1.5-7b-chat	0.430
7120	google_codegemma_1.1_7b_it	0.375
4890	google_codegemma_1.1_7b_it	0.375
8092	mistralai_mistral_7b_instruct_v0.3	0.363
1364	mistralai_mistral_7b_instruct_v0.3	0.363
8843	qwen2.5-coder-1.5b-instruct	0.349
6640	qwen2.5-coder-1.5b-instruct	0.349
4275	qwen3-0.6b	0.300
7236	qwen3-0.6b	0.300
6897	mistralai_mistral_7b_instruct_v0.2	0.298
2515	mistralai_mistral_7b_instruct_v0.2	0.298
6401	mistralai_mistral_7b_instruct_v0.2	0.298
2376	google_gemma_3_1b_it	0.296
2343	qwen2-1.5b-instruct	0.258
8014	qwen2-1.5b-instruct	0.258
5928	qwen2-1.5b-instruct	0.258
8131	qwen2-1.5b-instruct	0.258
1168	llama-3.2-1B-instruct	0.243
10051	google_gemma_7b_it	0.191
5796	google_gemma_7b_it	0.191
1658	google_gemma_7b_it	0.191
9868	qwen1.5-1.8b-chat	0.151
1167	qwen1.5-1.8b-chat	0.151
2595	qwen2-0.5b-instruct	0.116
3847	qwen2-0.5b-instruct	0.116
8012	qwen2-0.5b-instruct	0.116
8008	qwen2-0.5b-instruct	0.116
6311	qwen2.5-coder-0.5b-instruct	0.090
1388	qwen2.5-coder-0.5b-instruct	0.090
3193	google_gemma_2b_it	0.063
2867	qwen1.5-0.5b-chat	0.043
1209	qwen1.5-0.5b-chat	0.043
6036	qwen1.5-0.5b-chat	0.043
7337	qwen1.5-0.5b-chat	0.043

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
717	0.203	-0.578
9252	0.066	-0.535
369	0.057	-0.519
3810	0.059	-0.478
7068	0.302	-0.467
9301	0.241	-0.459
5997	0.052	-0.453
4347	0.028	-0.443
4100	0.069	-0.429
7449	0.021	-0.425

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.