gsm8k_plus_cot: by examples

Results Paper Code


Not solved by any model

There are 156 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
10155, 10380, 10497, 1131, 1209, 1388, 1416, 1444, 1465, 1585, 1611, 164, 1641, 1658, 1675, 1932, 2044, 2121, 2122, 2344, 2504, 2507, 2515, 2564, 2692, 2729, 2732, 2769, 2778, 2866, 2889, 291, 2949, 3043, 3067, 3112, 3138, 3148, 3179, 3275, 3276, 3281, 3377, 3386, 3409, 3507, 3576, 3582, 3610, 3637, 3638, 380, 3818, 3953, 3954, 3961, 3973, 4033, 4211, 4222, 4275, 4348, 4401, 442, 4432, 4434, 4443, 4548, 4587, 4720, 4722, 4725, 4884, 4888, 4890, 4892, 4923, 4963, 499, 501, 5120, 5132, 5198, 5232, 5258, 5281, 5338, 5403, 5657, 5788, 5796, 6035, 6036, 6241, 6388, 6401, 6467, 6499, 6537, 6590, 6640, 6779, 6809, 682, 6844, 6865, 6897, 690, 696, 699, 7012, 7120, 7148, 7232, 7233, 7337, 7356, 7440, 7552, 7564, 7611, 7619, 7705, 786, 7995, 8008, 8009, 8011, 8012, 8027, 8040, 8128, 8131, 8147, 8336, 8337, 8386, 8387, 8389, 843, 8531, 8562, 8706, 8708, 8843, 9096, 9276, 9311, 9329, 9404, 9522, 958, 9627, 9635, 980, 9873

Problems solved by 1 model only

example_link model min_pass1_of_model
2371 qwen3-32b 0.814
5405 qwen3-14b 0.811
6312 llama-3.1-70B-instruct 0.773
8014 llama-3.1-70B-instruct 0.773
4497 llama-3.1-70B-instruct 0.773
2138 google_gemma_3_27b_it 0.751
3958 google_gemma_3_27b_it 0.751
2950 qwen2-72b-instruct 0.744
467 qwen2-72b-instruct 0.744
9868 qwen2-72b-instruct 0.744
3427 qwen2-72b-instruct 0.744
7482 qwen3-4b 0.737
3243 deepseek_r1_distill_qwen_14b 0.710
5512 deepseek_r1_distill_qwen_14b 0.710
450 google_gemma_2_9b_it 0.701
9539 deepseek_r1_distill_qwen_7b 0.686
1972 deepseek_r1_distill_qwen_7b 0.686
4723 qwen2.5-coder-14b-instruct 0.668
6262 qwen2.5-coder-14b-instruct 0.668
956 qwen2-math-72b-instruct 0.663
2139 qwen2-math-72b-instruct 0.663
4726 qwen2-math-72b-instruct 0.663
1816 qwen2-math-72b-instruct 0.663
1819 qwen2-math-72b-instruct 0.663
1299 qwen1.5-72b-chat 0.657
8705 qwen1.5-72b-chat 0.657
4258 qwen1.5-72b-chat 0.657
7627 qwen1.5-32b-chat 0.648
4148 qwen1.5-32b-chat 0.648
5067 google_gemma_3_4b_it 0.641
2748 mistralai_mixtral_8x22b_instruct_v0.1 0.632
9075 mistralai_mixtral_8x22b_instruct_v0.1 0.632
5052 mistralai_mixtral_8x22b_instruct_v0.1 0.632
1258 mistralai_mixtral_8x22b_instruct_v0.1 0.632
9523 qwen2-math-7b-instruct 0.620
5451 qwen2-math-7b-instruct 0.620
9363 qwen2-math-7b-instruct 0.620
2953 qwen2-math-7b-instruct 0.620
954 llama-3.1-8B-instruct 0.606
5675 mistralai_ministral_8b_instruct_2410 0.597
1532 qwen2-7b-instruct 0.592
8180 qwen2-7b-instruct 0.592
2946 qwen2-7b-instruct 0.592
8211 qwen2-7b-instruct 0.592
9043 qwen3-1.7b 0.586
9068 qwen2-math-1.5b-instruct 0.568
4042 qwen2-math-1.5b-instruct 0.568
104 qwen1.5-14b-chat 0.536
1393 qwen1.5-14b-chat 0.536
6459 qwen1.5-14b-chat 0.536
5322 qwen1.5-14b-chat 0.536
5660 qwen1.5-14b-chat 0.536
9932 qwen2.5-coder-7b-instruct 0.519
6898 qwen2.5-coder-7b-instruct 0.519
5195 deepseek_r1_distill_qwen_1.5b 0.496
5376 mistralai_mathstral_7b_v0.1 0.475
5771 mistralai_mathstral_7b_v0.1 0.475
5221 mistralai_mathstral_7b_v0.1 0.475
1364 mistralai_mathstral_7b_v0.1 0.475
1507 mistralai_mathstral_7b_v0.1 0.475
7332 qwen1.5-7b-chat 0.402
4164 qwen1.5-7b-chat 0.402
4531 google_codegemma_1.1_7b_it 0.335
9810 google_gemma_3_1b_it 0.289
3230 google_gemma_3_1b_it 0.289
7236 qwen3-0.6b 0.280
4436 qwen3-0.6b 0.280
4617 qwen3-0.6b 0.280
1056 mistralai_mistral_7b_instruct_v0.2 0.279
9608 llama-3.2-1B-instruct 0.190
5812 llama-3.2-1B-instruct 0.190
7347 google_gemma_7b_it 0.185
4311 mistralai_mistral_7b_instruct_v0.1 0.160
1191 qwen2-1.5b-instruct 0.152
8394 qwen2-1.5b-instruct 0.152
6447 qwen1.5-1.8b-chat 0.116
5297 qwen2.5-coder-0.5b-instruct 0.071
1168 qwen2.5-coder-0.5b-instruct 0.071
3193 qwen2.5-coder-0.5b-instruct 0.071
3464 qwen2.5-coder-0.5b-instruct 0.071
936 qwen2.5-coder-0.5b-instruct 0.071
2891 qwen2-0.5b-instruct 0.066
4879 qwen2-0.5b-instruct 0.066
6905 google_gemma_2b_it 0.062
2867 google_gemma_2b_it 0.062
3904 google_gemma_2b_it 0.062
411 google_gemma_2b_it 0.062
7299 google_gemma_2b_it 0.062
9719 qwen1.5-0.5b-chat 0.054
8092 qwen1.5-0.5b-chat 0.054
3220 qwen1.5-0.5b-chat 0.054
2890 qwen1.5-0.5b-chat 0.054

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
8015 0.083 -0.459
2935 0.046 -0.431
10047 0.038 -0.413
9991 0.074 -0.411
717 0.199 -0.404
7535 0.080 -0.393
9252 0.072 -0.389
2039 0.045 -0.387
7068 0.252 -0.380
5663 0.026 -0.378

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.