Raw data: summary.csv
| benchmark_id | size | models | SE(A) | SE_x(A) | SE(A-B) | SE_x(A-B) | corr(A,B) | no_solve | tau- | details |
|---|---|---|---|---|---|---|---|---|---|---|
| aime2024_cot | 30 | 50 | 7.6 | 5.7 | 8 | 3.7 | 77 | 3.3 | 0 | models | examples | data | raw |
| aime2025_cot | 30 | 50 | 7 | 5.6 | 6.8 | 3.3 | 82 | 3.3 | 13 | models | examples | data | raw |
| human_eval_plus | 164 | 49 | 3.6 | 2.7 | 4.3 | 2.4 | 61 | 6.1 | 1.2 | models | examples | data | raw |
| math500_cot | 500 | 48 | 2 | 1.5 | 2.1 | 0.94 | 79 | 0.4 | 1.2 | models | examples | data | raw |