Raw data: summary.csv
| benchmark_id | size | models | SE(A) | SE_x(A) | SE(A-B) | SE_x(A-B) | corr(A,B) | no_solve | tau- | details |
|---|---|---|---|---|---|---|---|---|---|---|
| aime2024_cot | 30 | 52 | 7.3 | 5.7 | 8.1 | 4.7 | 61 | 20 | 0 | models | examples | data | raw |
| aime2025_cot | 30 | 52 | 7.2 | 5.9 | 6.7 | 3.4 | 80 | 27 | 3.3 | models | examples | data | raw |
| ap_cot | 711 | 52 | 1.6 | 1.2 | 1.9 | 1.1 | 51 | 0 | 3.1 | models | examples | data | raw |
| bbh_cot | 6511 | 45 | 0.56 | 0.41 | 0.68 | 0.4 | 48 | 1.2 | 2.1 | models | examples | data | raw |
| cruxeval_input_cot | 800 | 52 | 1.6 | 1.2 | 1.9 | 1 | 60 | 0.62 | 1.5 | models | examples | data | raw |
| cruxeval_output_cot | 800 | 52 | 1.6 | 1.2 | 1.8 | 0.93 | 65 | 2.5 | 1.1 | models | examples | data | raw |
| ds1000 | 1000 | 39 | 1.3 | 1 | 1.3 | 0.8 | 65 | 27 | 0.9 | models | examples | data | raw |
| gmat_cot | 92 | 51 | 4.6 | 3.1 | 5.8 | 3.4 | 43 | 0 | 1.1 | models | examples | data | raw |
| gpqa_cot | 448 | 51 | 2.1 | 1.3 | 2.8 | 1.6 | 31 | 0 | 21 | models | examples | data | raw |
| gre_physics_cot | 75 | 51 | 5.2 | 3.7 | 6.6 | 4.2 | 32 | 0 | 5.3 | models | examples | data | raw |
| gsm8k_cot | 1319 | 52 | 1.1 | 0.78 | 1.1 | 0.64 | 61 | 0.15 | 0.99 | models | examples | data | raw |
| gsm8k_plus_cot | 10552 | 52 | 0.44 | 0.34 | 0.44 | 0.25 | 71 | 0.78 | 5.6 | models | examples | data | raw |
| human_eval | 164 | 51 | 3.5 | 2.6 | 4.4 | 2.6 | 49 | 0.61 | 0 | models | examples | data | raw |
| human_eval_plus | 164 | 51 | 3.6 | 2.8 | 4.3 | 2.6 | 53 | 6.1 | 1.2 | models | examples | data | raw |
| jeebench_chat_cot | 515 | 49 | 1.6 | 1.1 | 2 | 1.2 | 40 | 14 | 5.2 | models | examples | data | raw |
| leetcode | 180 | 51 | 3 | 2.4 | 3.4 | 2.2 | 54 | 29 | 0 | models | examples | data | raw |
| lsat_cot | 403 | 45 | 2.3 | 1.8 | 2.7 | 1.8 | 48 | 0.25 | 10 | models | examples | data | raw |
| math500_cot | 500 | 51 | 2 | 1.5 | 2.1 | 1.1 | 69 | 0.6 | 1 | models | examples | data | raw |
| math_cot | 5000 | 49 | 0.63 | 0.47 | 0.73 | 0.37 | 66 | 1.1 | 0.64 | models | examples | data | raw |
| mbpp | 500 | 35 | 2.1 | 1.7 | 2.1 | 1.4 | 69 | 3 | 2.4 | models | examples | data | raw |
| mgsm_cot | 2750 | 52 | 0.83 | 0.66 | 0.9 | 0.53 | 64 | 0.36 | 1.6 | models | examples | data | raw |
| mmlu_pro_cot | 12032 | 45 | 0.42 | 0.31 | 0.49 | 0.29 | 54 | 0.066 | 13 | models | examples | data | raw |