math_cot: by examples

Results Paper Code

Not solved by any model

There are 54 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1099, 1122, 1171, 1256, 1261, 1282, 1297, 1304, 1308, 1336, 1925, 1993, 2052, 2058, 2072, 2093, 2194, 2251, 2279, 2351, 2352, 2426, 2436, 3133, 3164, 3198, 3295, 3335, 3409, 3412, 342, 3489, 3535, 3538, 3545, 3683, 3687, 3711, 3736, 3746, 3784, 4089, 410, 4193, 4674, 4730, 51, 625, 654, 658, 676, 775, 810, 825

Problems solved by 1 model only

example_link	model	min_pass1_of_model
2408	google_gemma_3_27b_it	0.861
4404	qwen3-32b	0.769
3615	deepseek_r1_distill_llama_8b	0.691
919	qwen3-1.7b	0.639
1369	deepseek_r1_distill_qwen_1.5b	0.626
3398	deepseek_r1_distill_qwen_14b	0.613
1318	deepseek_r1_distill_qwen_14b	0.613
907	deepseek_r1_distill_qwen_14b	0.613
1127	deepseek_r1_distill_qwen_32b	0.580
828	qwen2.5-coder-7b-instruct	0.459
4120	qwen1.5-14b-chat	0.315
3521	google_gemma_7b_it	0.119
1352	mistralai_mistral_7b_instruct_v0.1	0.071
3863	google_gemma_2b_it	0.063

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	pass1_of_ex	tau
163	0.039	-0.223
2751	0.047	-0.204
432	0.018	-0.191
3863	0.005	-0.185
3476	0.012	-0.184
4917	0.005	-0.179
1352	0.002	-0.177
630	0.060	-0.174
1206	0.176	-0.161
3186	0.071	-0.151

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum win rate to solve each problem.