gsm8k_plus_cot: by examples

Results Paper Code

Not solved by any model

There are 156 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
10155, 10380, 10497, 1131, 1209, 1388, 1416, 1444, 1465, 1585, 1611, 164, 1641, 1658, 1675, 1932, 2044, 2121, 2122, 2344, 2504, 2507, 2515, 2564, 2692, 2729, 2732, 2769, 2778, 2866, 2889, 291, 2949, 3043, 3067, 3112, 3138, 3148, 3179, 3275, 3276, 3281, 3377, 3386, 3409, 3507, 3576, 3582, 3610, 3637, 3638, 380, 3818, 3953, 3954, 3961, 3973, 4033, 4211, 4222, 4275, 4348, 4401, 442, 4432, 4434, 4443, 4548, 4587, 4720, 4722, 4725, 4884, 4888, 4890, 4892, 4923, 4963, 499, 501, 5120, 5132, 5198, 5232, 5258, 5281, 5338, 5403, 5657, 5788, 5796, 6035, 6036, 6241, 6388, 6401, 6467, 6499, 6537, 6590, 6640, 6779, 6809, 682, 6844, 6865, 6897, 690, 696, 699, 7012, 7120, 7148, 7232, 7233, 7337, 7356, 7440, 7552, 7564, 7611, 7619, 7705, 786, 7995, 8008, 8009, 8011, 8012, 8027, 8040, 8128, 8131, 8147, 8336, 8337, 8386, 8387, 8389, 843, 8531, 8562, 8706, 8708, 8843, 9096, 9276, 9311, 9329, 9404, 9522, 958, 9627, 9635, 980, 9873

Problems solved by 1 model only

example_link	model	min_pass1_of_model
2371	qwen3-32b	0.814
5405	qwen3-14b	0.811
6312	llama-3.1-70B-instruct	0.773
8014	llama-3.1-70B-instruct	0.773
4497	llama-3.1-70B-instruct	0.773
2138	google_gemma_3_27b_it	0.751
3958	google_gemma_3_27b_it	0.751
2950	qwen2-72b-instruct	0.744
467	qwen2-72b-instruct	0.744
9868	qwen2-72b-instruct	0.744
3427	qwen2-72b-instruct	0.744
7482	qwen3-4b	0.737
3243	deepseek_r1_distill_qwen_14b	0.710
5512	deepseek_r1_distill_qwen_14b	0.710
450	google_gemma_2_9b_it	0.701
9539	deepseek_r1_distill_qwen_7b	0.686
1972	deepseek_r1_distill_qwen_7b	0.686
4723	qwen2.5-coder-14b-instruct	0.668
6262	qwen2.5-coder-14b-instruct	0.668
956	qwen2-math-72b-instruct	0.663
2139	qwen2-math-72b-instruct	0.663
4726	qwen2-math-72b-instruct	0.663
1816	qwen2-math-72b-instruct	0.663
1819	qwen2-math-72b-instruct	0.663
1299	qwen1.5-72b-chat	0.657
8705	qwen1.5-72b-chat	0.657
4258	qwen1.5-72b-chat	0.657
7627	qwen1.5-32b-chat	0.648
4148	qwen1.5-32b-chat	0.648
5067	google_gemma_3_4b_it	0.641
2748	mistralai_mixtral_8x22b_instruct_v0.1	0.632
9075	mistralai_mixtral_8x22b_instruct_v0.1	0.632
5052	mistralai_mixtral_8x22b_instruct_v0.1	0.632
1258	mistralai_mixtral_8x22b_instruct_v0.1	0.632
9523	qwen2-math-7b-instruct	0.620
5451	qwen2-math-7b-instruct	0.620
9363	qwen2-math-7b-instruct	0.620
2953	qwen2-math-7b-instruct	0.620
954	llama-3.1-8B-instruct	0.606
5675	mistralai_ministral_8b_instruct_2410	0.597
1532	qwen2-7b-instruct	0.592
8180	qwen2-7b-instruct	0.592
2946	qwen2-7b-instruct	0.592
8211	qwen2-7b-instruct	0.592
9043	qwen3-1.7b	0.586
9068	qwen2-math-1.5b-instruct	0.568
4042	qwen2-math-1.5b-instruct	0.568
104	qwen1.5-14b-chat	0.536
1393	qwen1.5-14b-chat	0.536
6459	qwen1.5-14b-chat	0.536
5322	qwen1.5-14b-chat	0.536
5660	qwen1.5-14b-chat	0.536
9932	qwen2.5-coder-7b-instruct	0.519
6898	qwen2.5-coder-7b-instruct	0.519
5195	deepseek_r1_distill_qwen_1.5b	0.496
5376	mistralai_mathstral_7b_v0.1	0.475
5771	mistralai_mathstral_7b_v0.1	0.475
5221	mistralai_mathstral_7b_v0.1	0.475
1364	mistralai_mathstral_7b_v0.1	0.475
1507	mistralai_mathstral_7b_v0.1	0.475
7332	qwen1.5-7b-chat	0.402
4164	qwen1.5-7b-chat	0.402
4531	google_codegemma_1.1_7b_it	0.335
9810	google_gemma_3_1b_it	0.289
3230	google_gemma_3_1b_it	0.289
7236	qwen3-0.6b	0.280
4436	qwen3-0.6b	0.280
4617	qwen3-0.6b	0.280
1056	mistralai_mistral_7b_instruct_v0.2	0.279
9608	llama-3.2-1B-instruct	0.190
5812	llama-3.2-1B-instruct	0.190
7347	google_gemma_7b_it	0.185
4311	mistralai_mistral_7b_instruct_v0.1	0.160
1191	qwen2-1.5b-instruct	0.152
8394	qwen2-1.5b-instruct	0.152
6447	qwen1.5-1.8b-chat	0.116
5297	qwen2.5-coder-0.5b-instruct	0.071
1168	qwen2.5-coder-0.5b-instruct	0.071
3193	qwen2.5-coder-0.5b-instruct	0.071
3464	qwen2.5-coder-0.5b-instruct	0.071
936	qwen2.5-coder-0.5b-instruct	0.071
2891	qwen2-0.5b-instruct	0.066
4879	qwen2-0.5b-instruct	0.066
6905	google_gemma_2b_it	0.062
2867	google_gemma_2b_it	0.062
3904	google_gemma_2b_it	0.062
411	google_gemma_2b_it	0.062
7299	google_gemma_2b_it	0.062
9719	qwen1.5-0.5b-chat	0.054
8092	qwen1.5-0.5b-chat	0.054
3220	qwen1.5-0.5b-chat	0.054
2890	qwen1.5-0.5b-chat	0.054

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
8015	0.083	-0.459
2935	0.046	-0.431
10047	0.038	-0.413
9991	0.074	-0.411
717	0.199	-0.404
7535	0.080	-0.393
9252	0.072	-0.389
2039	0.045	-0.387
7068	0.252	-0.380
5663	0.026	-0.378

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.