math_cot: by examples

Results Paper Code


Not solved by any model

There are 97 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1038, 1099, 1122, 1171, 1185, 1209, 1220, 1256, 1259, 1261, 1268, 1282, 1297, 1304, 1308, 1318, 1336, 1369, 1839, 1963, 1993, 2052, 2058, 2072, 2093, 2178, 218, 2194, 2251, 2279, 2296, 2303, 2305, 2320, 2351, 2352, 2371, 2377, 2395, 2398, 2426, 2436, 2678, 3124, 3133, 3164, 3190, 3198, 3295, 3340, 3363, 3398, 3399, 3409, 3412, 342, 3458, 3459, 3489, 3535, 3538, 3545, 3550, 3644, 3648, 368, 3683, 3687, 3711, 3713, 3736, 3784, 3793, 395, 4089, 410, 4193, 432, 4404, 4674, 4730, 4917, 509, 51, 561, 604, 625, 654, 658, 676, 710, 722, 744, 810, 825, 828, 919

Problems solved by 1 model only

example_link model min_pass1_of_model
1073 google_gemma_3_27b_it 0.865
1599 google_gemma_3_27b_it 0.865
2157 google_gemma_3_27b_it 0.865
2189 google_gemma_3_27b_it 0.865
968 google_gemma_3_27b_it 0.865
673 google_gemma_3_27b_it 0.865
2239 google_gemma_3_27b_it 0.865
3783 qwen3-14b 0.824
384 qwen3-14b 0.824
2408 google_gemma_3_12b_it 0.802
1953 google_gemma_3_12b_it 0.802
1989 google_gemma_3_12b_it 0.802
3339 qwen3-4b 0.781
3335 qwen3-8b 0.778
835 qwen3-8b 0.778
477 qwen3-32b 0.752
806 qwen3-32b 0.752
1944 google_gemma_3_4b_it 0.715
505 deepseek_r1_distill_llama_70b 0.709
3471 deepseek_r1_distill_llama_70b 0.709
555 deepseek_r1_distill_qwen_7b 0.684
2156 deepseek_r1_distill_qwen_7b 0.684
1052 deepseek_r1_distill_qwen_7b 0.684
577 deepseek_r1_distill_llama_8b 0.656
3746 deepseek_r1_distill_llama_8b 0.656
3761 deepseek_r1_distill_llama_8b 0.656
844 deepseek_r1_distill_qwen_14b 0.599
3734 deepseek_r1_distill_qwen_14b 0.599
2403 deepseek_r1_distill_qwen_14b 0.599
907 deepseek_r1_distill_qwen_14b 0.599
581 deepseek_r1_distill_qwen_14b 0.599
824 deepseek_r1_distill_qwen_14b 0.599
1127 deepseek_r1_distill_qwen_32b 0.578
790 deepseek_r1_distill_qwen_32b 0.578
3733 google_gemma_2_9b_it 0.440
775 qwen2.5-coder-7b-instruct 0.353
4120 mistralai_mixtral_8x22b_instruct_v0.1 0.332
879 google_gemma_3_1b_it 0.321
3521 llama-3.2-3B-instruct 0.309
4867 qwen2.5-coder-3b-instruct 0.299
1925 mistralai_ministral_8b_instruct_2410 0.285
1924 deepseek_v2_lite_chat 0.220
1298 deepseek_v2_lite_chat 0.220
560 qwen2-1.5b-instruct 0.090
1352 qwen2-1.5b-instruct 0.090
3863 google_gemma_2b_it 0.060
146 qwen2-0.5b-instruct 0.039

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link pass1_of_ex tau
3186 0.061 -0.253
146 0.005 -0.187
2203 0.031 -0.171
3863 0.007 -0.161
560 0.007 -0.152
1352 0.007 -0.152
2126 0.050 -0.152
897 0.012 -0.151
1901 0.083 -0.148
630 0.068 -0.133

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.