Full table: summary.csv
| benchmark_id | size | models | SE(A) | SE_x(A) | SE(A-B) | SE_x(A-B) | no_solve | tau- | details |
|---|---|---|---|---|---|---|---|---|---|
| aime2024_cot | 30 | 52 | 7.3 | 5.7 | 8.1 | 4.7 | 20 | 0 | models | examples | data | more |
| aime2025_cot | 30 | 52 | 7.2 | 5.9 | 6.7 | 3.4 | 27 | 3.3 | models | examples | data | more |
| ap_cot | 711 | 52 | 1.6 | 1.2 | 1.9 | 1.1 | 0 | 3.1 | models | examples | data | more |
| bbh_cot | 6511 | 45 | 0.56 | 0.41 | 0.68 | 0.4 | 1.2 | 2.1 | models | examples | data | more |
| cruxeval_input_cot | 800 | 52 | 1.6 | 1.2 | 1.9 | 1 | 0.62 | 1.5 | models | examples | data | more |
| cruxeval_output_cot | 800 | 52 | 1.6 | 1.2 | 1.8 | 0.93 | 2.5 | 1.1 | models | examples | data | more |
| ds1000 | 1000 | 39 | 1.3 | 1 | 1.3 | 0.8 | 27 | 0.9 | models | examples | data | more |
| gmat_cot | 92 | 51 | 4.6 | 3.1 | 5.8 | 3.4 | 0 | 1.1 | models | examples | data | more |
| gpqa_cot | 448 | 51 | 2.1 | 1.3 | 2.8 | 1.6 | 0 | 21 | models | examples | data | more |
| gre_physics_cot | 75 | 51 | 5.2 | 3.7 | 6.6 | 4.2 | 0 | 5.3 | models | examples | data | more |
| gsm8k_cot | 1319 | 52 | 1.1 | 0.78 | 1.1 | 0.64 | 0.15 | 0.99 | models | examples | data | more |
| gsm8k_plus_cot | 10552 | 52 | 0.44 | 0.34 | 0.44 | 0.25 | 0.78 | 5.6 | models | examples | data | more |
| human_eval | 164 | 51 | 3.5 | 2.6 | 4.4 | 2.6 | 0.61 | 0 | models | examples | data | more |
| human_eval_plus | 164 | 51 | 3.6 | 2.8 | 4.3 | 2.6 | 6.1 | 1.2 | models | examples | data | more |
| jeebench_chat_cot | 515 | 49 | 1.6 | 1.1 | 2 | 1.2 | 14 | 5.2 | models | examples | data | more |
| leetcode | 180 | 51 | 3 | 2.4 | 3.4 | 2.2 | 29 | 0 | models | examples | data | more |
| lsat_cot | 403 | 45 | 2.3 | 1.8 | 2.7 | 1.8 | 0.25 | 10 | models | examples | data | more |
| math500_cot | 500 | 51 | 2 | 1.5 | 2.1 | 1.1 | 0.6 | 1 | models | examples | data | more |
| math_cot | 5000 | 49 | 0.63 | 0.47 | 0.73 | 0.37 | 1.1 | 0.64 | models | examples | data | more |
| mbpp | 500 | 35 | 2.1 | 1.7 | 2.1 | 1.4 | 3 | 2.4 | models | examples | data | more |
| mgsm_cot | 2750 | 52 | 0.83 | 0.66 | 0.9 | 0.53 | 0.36 | 1.6 | models | examples | data | more |
| mmlu_pro_cot | 12032 | 45 | 0.42 | 0.31 | 0.49 | 0.29 | 0.066 | 13 | models | examples | data | more |
Generated on 2026-02-14 06:32:13