Raw data: summary.csv
| benchmark_id | size | models | SE(A) | SE(A-B) | corr(A,B) | no_solve | tau- | details |
|---|---|---|---|---|---|---|---|---|
| CRUXEval-input-T0.2 | 800 | 37 | 1.7 | 1.8 | 56 | 2.4 | 10 | models | examples | data | raw |
| CRUXEval-input-T0.8 | 800 | 31 | 1.6 | 1.8 | 69 | 2.1 | 10 | models | examples | data | raw |
| CRUXEval-output-T0.2 | 800 | 37 | 1.7 | 1.6 | 64 | 3 | 3.8 | models | examples | data | raw |
| CRUXEval-output-T0.8 | 800 | 31 | 1.7 | 1.6 | 75 | 4.4 | 6.9 | models | examples | data | raw |
| DS1000 | 1000 | 105 | 1.4 | 1.5 | 46 | 11 | 1.2 | models | examples | data | raw |
| agi_english | 2546 | 35 | 0.93 | 1.1 | 25 | 0.98 | 21 | models | examples | data | raw |
| arc_challenge | 1165 | 38 | 1.4 | 1.3 | 62 | 14 | 11 | models | examples | data | raw |
| gsm8k | 1319 | 37 | 1.2 | 1.3 | 38 | 2.1 | 0.53 | models | examples | data | raw |
| hellaswag | 10042 | 36 | 0.41 | 0.26 | 77 | 6.1 | 1.6 | models | examples | data | raw |
| humaneval | 164 | 78 | 3.5 | 3.8 | 46 | 3.7 | 1.2 | models | examples | data | raw |
| humaneval+ | 164 | 49 | 3.7 | 3.9 | 45 | 4.3 | 1.8 | models | examples | data | raw |
| lcb_codegen_v5 | 880 | 24 | 1.5 | 1.3 | 66 | 6.4 | 1.1 | models | examples | data | raw |
| lcb_codegen_v6 | 1055 | 33 | 1.3 | 1.1 | 65 | 3.4 | 1.2 | models | examples | data | raw |
| lcb_codegen_v6_080124 | 454 | 33 | 2.2 | 1.9 | 65 | 5.3 | 2 | models | examples | data | raw |
| mbpp | 378 | 59 | 2.4 | 2.4 | 47 | 2.4 | 4 | models | examples | data | raw |
| mbpp+ | 378 | 59 | 2.5 | 2.4 | 54 | 9.5 | 5.8 | models | examples | data | raw |
| mmlu | 14042 | 36 | 0.39 | 0.44 | 38 | 0.6 | 12 | models | examples | data | raw |
| nq | 3610 | 36 | 0.72 | 0.64 | 60 | 32 | 5.5 | models | examples | data | raw |
| piqa | 1838 | 36 | 0.93 | 0.7 | 71 | 5.3 | 8.4 | models | examples | data | raw |
| safim | 17410 | 22 | 0.36 | 0.36 | 50 | 16 | 7.9 | models | examples | data | raw |
| siqa | 1954 | 36 | 1.1 | 0.86 | 71 | 15 | 19 | models | examples | data | raw |
| swebench-lite | 300 | 84 | 2.7 | 2.6 | 52 | 12 | 1.7 | models | examples | data | raw |
| swebench-test | 2294 | 23 | 0.87 | 0.77 | 55 | 43 | 1 | models | examples | data | raw |
| swebench-verified | 500 | 131 | 2.1 | 1.9 | 58 | 6.8 | 1.6 | models | examples | data | raw |
| tqa | 11313 | 36 | 0.43 | 0.35 | 67 | 12 | 2.9 | models | examples | data | raw |