math_cot: by examples

Results Paper Code

Not solved by any model

There are 97 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1038, 1099, 1122, 1171, 1185, 1209, 1220, 1256, 1259, 1261, 1268, 1282, 1297, 1304, 1308, 1318, 1336, 1369, 1839, 1963, 1993, 2052, 2058, 2072, 2093, 2178, 218, 2194, 2251, 2279, 2296, 2303, 2305, 2320, 2351, 2352, 2371, 2377, 2395, 2398, 2426, 2436, 2678, 3124, 3133, 3164, 3190, 3198, 3295, 3340, 3363, 3398, 3399, 3409, 3412, 342, 3458, 3459, 3489, 3535, 3538, 3545, 3550, 3644, 3648, 368, 3683, 3687, 3711, 3713, 3736, 3784, 3793, 395, 4089, 410, 4193, 432, 4404, 4674, 4730, 4917, 509, 51, 561, 604, 625, 654, 658, 676, 710, 722, 744, 810, 825, 828, 919

Problems solved by 1 model only

example_link	model	min_pass1_of_model
1073	google_gemma_3_27b_it	0.865
1599	google_gemma_3_27b_it	0.865
2157	google_gemma_3_27b_it	0.865
2189	google_gemma_3_27b_it	0.865
968	google_gemma_3_27b_it	0.865
673	google_gemma_3_27b_it	0.865
2239	google_gemma_3_27b_it	0.865
3783	qwen3-14b	0.824
384	qwen3-14b	0.824
2408	google_gemma_3_12b_it	0.802
1953	google_gemma_3_12b_it	0.802
1989	google_gemma_3_12b_it	0.802
3339	qwen3-4b	0.781
3335	qwen3-8b	0.778
835	qwen3-8b	0.778
477	qwen3-32b	0.752
806	qwen3-32b	0.752
1944	google_gemma_3_4b_it	0.715
505	deepseek_r1_distill_llama_70b	0.709
3471	deepseek_r1_distill_llama_70b	0.709
555	deepseek_r1_distill_qwen_7b	0.684
2156	deepseek_r1_distill_qwen_7b	0.684
1052	deepseek_r1_distill_qwen_7b	0.684
577	deepseek_r1_distill_llama_8b	0.656
3746	deepseek_r1_distill_llama_8b	0.656
3761	deepseek_r1_distill_llama_8b	0.656
844	deepseek_r1_distill_qwen_14b	0.599
3734	deepseek_r1_distill_qwen_14b	0.599
2403	deepseek_r1_distill_qwen_14b	0.599
907	deepseek_r1_distill_qwen_14b	0.599
581	deepseek_r1_distill_qwen_14b	0.599
824	deepseek_r1_distill_qwen_14b	0.599
1127	deepseek_r1_distill_qwen_32b	0.578
790	deepseek_r1_distill_qwen_32b	0.578
3733	google_gemma_2_9b_it	0.440
775	qwen2.5-coder-7b-instruct	0.353
4120	mistralai_mixtral_8x22b_instruct_v0.1	0.332
879	google_gemma_3_1b_it	0.321
3521	llama-3.2-3B-instruct	0.309
4867	qwen2.5-coder-3b-instruct	0.299
1925	mistralai_ministral_8b_instruct_2410	0.285
1924	deepseek_v2_lite_chat	0.220
1298	deepseek_v2_lite_chat	0.220
560	qwen2-1.5b-instruct	0.090
1352	qwen2-1.5b-instruct	0.090
3863	google_gemma_2b_it	0.060
146	qwen2-0.5b-instruct	0.039

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
3186	0.061	-0.253
146	0.005	-0.187
2203	0.031	-0.171
3863	0.007	-0.161
560	0.007	-0.152
1352	0.007	-0.152
2126	0.050	-0.152
897	0.012	-0.151
1901	0.083	-0.148
630	0.068	-0.133

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.