gsm8k_plus_cot: by examples

Results Paper Code


Not solved by any model

There are 82 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
10380, 1056, 1416, 1444, 1585, 1611, 164, 1819, 1932, 2122, 2344, 2507, 2564, 2692, 2778, 2866, 3067, 3112, 3148, 3275, 3377, 3409, 3464, 3610, 3637, 3638, 3904, 3954, 3961, 3973, 4033, 4222, 4401, 442, 4432, 450, 4587, 4722, 4725, 4726, 4884, 4892, 4963, 502, 5120, 5232, 5297, 5403, 5788, 5913, 6312, 6467, 6537, 6779, 6809, 6844, 690, 696, 7012, 7233, 7356, 7552, 7705, 7995, 8011, 8013, 8040, 8128, 8147, 8336, 8337, 8341, 8386, 8387, 843, 8562, 8706, 9096, 936, 9404, 980, 9873

Problems solved by 1 model only

example_link model min_pass1_of_model
7887 deepseek_r1_distill_qwen_32b 0.751
3507 deepseek_r1_distill_qwen_14b 0.745
1972 deepseek_r1_distill_qwen_7b 0.733
6262 qwen2.5-coder-14b-instruct 0.709
10288 mistralai_mixtral_8x22b_instruct_v0.1 0.696
5052 mistralai_mixtral_8x22b_instruct_v0.1 0.696
5512 qwen1.5-32b-chat 0.671
2121 deepseek_r1_distill_llama_8b 0.668
2748 qwen2-math-7b-instruct 0.655
8531 qwen2-math-7b-instruct 0.655
3281 mistralai_ministral_8b_instruct_2410 0.646
9329 google_gemma_3_4b_it 0.643
4720 qwen2.5-coder-7b-instruct 0.614
7619 qwen2.5-coder-7b-instruct 0.614
1816 mistralai_mathstral_7b_v0.1 0.577
6865 qwen1.5-14b-chat 0.557
5059 qwen1.5-14b-chat 0.557
8027 qwen1.5-14b-chat 0.557
4148 qwen1.5-14b-chat 0.557
2504 deepseek_v2_lite_chat 0.499
9635 deepseek_v2_lite_chat 0.499
9276 mistralai_mixtral_8x7b_instruct_v0.1 0.492
6035 qwen1.5-7b-chat 0.430
7120 google_codegemma_1.1_7b_it 0.375
4890 google_codegemma_1.1_7b_it 0.375
8092 mistralai_mistral_7b_instruct_v0.3 0.363
1364 mistralai_mistral_7b_instruct_v0.3 0.363
8843 qwen2.5-coder-1.5b-instruct 0.349
6640 qwen2.5-coder-1.5b-instruct 0.349
4275 qwen3-0.6b 0.300
7236 qwen3-0.6b 0.300
6897 mistralai_mistral_7b_instruct_v0.2 0.298
2515 mistralai_mistral_7b_instruct_v0.2 0.298
6401 mistralai_mistral_7b_instruct_v0.2 0.298
2376 google_gemma_3_1b_it 0.296
2343 qwen2-1.5b-instruct 0.258
8014 qwen2-1.5b-instruct 0.258
5928 qwen2-1.5b-instruct 0.258
8131 qwen2-1.5b-instruct 0.258
1168 llama-3.2-1B-instruct 0.243
10051 google_gemma_7b_it 0.191
5796 google_gemma_7b_it 0.191
1658 google_gemma_7b_it 0.191
9868 qwen1.5-1.8b-chat 0.151
1167 qwen1.5-1.8b-chat 0.151
2595 qwen2-0.5b-instruct 0.116
3847 qwen2-0.5b-instruct 0.116
8012 qwen2-0.5b-instruct 0.116
8008 qwen2-0.5b-instruct 0.116
6311 qwen2.5-coder-0.5b-instruct 0.090
1388 qwen2.5-coder-0.5b-instruct 0.090
3193 google_gemma_2b_it 0.063
2867 qwen1.5-0.5b-chat 0.043
1209 qwen1.5-0.5b-chat 0.043
6036 qwen1.5-0.5b-chat 0.043
7337 qwen1.5-0.5b-chat 0.043

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
717 0.203 -0.578
9252 0.066 -0.535
369 0.057 -0.519
3810 0.059 -0.478
7068 0.302 -0.467
9301 0.241 -0.459
5997 0.052 -0.453
4347 0.028 -0.443
4100 0.069 -0.429
7449 0.021 -0.425

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.