math_cot: by examples

Results Paper Code


Not solved by any model

There are 54 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1099, 1122, 1171, 1256, 1261, 1282, 1297, 1304, 1308, 1336, 1925, 1993, 2052, 2058, 2072, 2093, 2194, 2251, 2279, 2351, 2352, 2426, 2436, 3133, 3164, 3198, 3295, 3335, 3409, 3412, 342, 3489, 3535, 3538, 3545, 3683, 3687, 3711, 3736, 3746, 3784, 4089, 410, 4193, 4674, 4730, 51, 625, 654, 658, 676, 775, 810, 825

Problems solved by 1 model only

example_link model min_pass1_of_model
2408 google_gemma_3_27b_it 0.861
4404 qwen3-32b 0.769
3615 deepseek_r1_distill_llama_8b 0.691
919 qwen3-1.7b 0.639
1369 deepseek_r1_distill_qwen_1.5b 0.626
3398 deepseek_r1_distill_qwen_14b 0.613
1318 deepseek_r1_distill_qwen_14b 0.613
907 deepseek_r1_distill_qwen_14b 0.613
1127 deepseek_r1_distill_qwen_32b 0.580
828 qwen2.5-coder-7b-instruct 0.459
4120 qwen1.5-14b-chat 0.315
3521 google_gemma_7b_it 0.119
1352 mistralai_mistral_7b_instruct_v0.1 0.071
3863 google_gemma_2b_it 0.063

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
163 0.039 -0.223
2751 0.047 -0.204
432 0.018 -0.191
3863 0.005 -0.185
3476 0.012 -0.184
4917 0.005 -0.179
1352 0.002 -0.177
630 0.060 -0.174
1206 0.176 -0.161
3186 0.071 -0.151

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.