Raw data: summary.csv
| benchmark_id | size | models | SE(A) | SE_x(A) | SE(A-B) | SE_x(A-B) | corr(A,B) | no_solve | tau- | details |
|---|---|---|---|---|---|---|---|---|---|---|
| aime2024_cot | 30 | 51 | 7.4 | 5.5 | 7.6 | 3 | 65 | 37 | 0 | models | examples | data | raw |
| aime2025_cot | 30 | 52 | 7 | 6 | 6.7 | 4.2 | 63 | 50 | 0 | models | examples | data | raw |
| ap_cot | 711 | 52 | 1.6 | 1.1 | 1.9 | 1.1 | 43 | 0 | 3.2 | models | examples | data | raw |
| bbh_cot | 6511 | 45 | 0.55 | 0.37 | 0.68 | 0.36 | 39 | 2 | 1.6 | models | examples | data | raw |
| cruxeval_input_cot | 800 | 52 | 1.6 | 1.1 | 1.9 | 0.85 | 49 | 2.4 | 1.4 | models | examples | data | raw |
| cruxeval_output_cot | 800 | 52 | 1.6 | 1.1 | 1.8 | 0.83 | 54 | 4.4 | 0.5 | models | examples | data | raw |
| ds1000 | 1000 | 39 | 1.2 | 0.93 | 1.3 | 0.66 | 58 | 32 | 0.3 | models | examples | data | raw |
| gmat_cot | 92 | 51 | 4.6 | 2.9 | 5.9 | 3.2 | 33 | 0 | 1.1 | models | examples | data | raw |
| gpqa_cot | 448 | 51 | 2.1 | 1.2 | 2.7 | 1.4 | 23 | 0 | 21 | models | examples | data | raw |
| gre_physics_cot | 75 | 51 | 5.2 | 3.4 | 6.7 | 3.9 | 26 | 0 | 4 | models | examples | data | raw |
| gsm8k_cot | 1319 | 52 | 1.1 | 0.76 | 1.2 | 0.61 | 49 | 0.076 | 0.91 | models | examples | data | raw |
| gsm8k_plus_cot | 10552 | 52 | 0.44 | 0.33 | 0.48 | 0.24 | 58 | 1.5 | 6.2 | models | examples | data | raw |
| human_eval | 164 | 51 | 3.5 | 2.4 | 4.4 | 2.2 | 41 | 1.2 | 0 | models | examples | data | raw |
| human_eval_plus | 164 | 51 | 3.6 | 2.5 | 4.3 | 2.3 | 44 | 6.7 | 0.61 | models | examples | data | raw |
| jeebench_chat_cot | 515 | 49 | 1.7 | 1.1 | 2.1 | 0.99 | 36 | 19 | 6 | models | examples | data | raw |
| leetcode | 180 | 51 | 3.1 | 2.3 | 3.5 | 1.8 | 47 | 39 | 0 | models | examples | data | raw |
| lsat_cot | 403 | 42 | 2.3 | 1.7 | 2.8 | 1.7 | 41 | 0.25 | 11 | models | examples | data | raw |
| math500_cot | 500 | 51 | 2 | 1.5 | 2.2 | 1 | 58 | 1.8 | 0.8 | models | examples | data | raw |
| math_cot | 5000 | 48 | 0.63 | 0.44 | 0.73 | 0.34 | 50 | 1.9 | 0.72 | models | examples | data | raw |
| mbpp | 500 | 35 | 2.1 | 1.7 | 2.2 | 1.2 | 64 | 3.8 | 2.2 | models | examples | data | raw |
| mgsm_cot | 2750 | 51 | 0.83 | 0.65 | 0.93 | 0.57 | 54 | 0.8 | 1.3 | models | examples | data | raw |
| mmlu_pro_cot | 12032 | 45 | 0.42 | 0.3 | 0.49 | 0.26 | 44 | 0.43 | 13 | models | examples | data | raw |