Full table: summary.csv
| benchmark_id | size | models | SE(A) | SE(A-B) | no_solve | tau- | details |
|---|---|---|---|---|---|---|---|
| CRUXEval-input-T0.2 | 800 | 37 | 1.7 | 1.8 | 2.4 | 10 | models | examples | data | more |
| CRUXEval-input-T0.8 | 800 | 31 | 1.6 | 1.8 | 2.1 | 10 | models | examples | data | more |
| CRUXEval-output-T0.2 | 800 | 37 | 1.7 | 1.6 | 3 | 3.8 | models | examples | data | more |
| CRUXEval-output-T0.8 | 800 | 31 | 1.7 | 1.6 | 4.4 | 6.9 | models | examples | data | more |
| DS1000 | 1000 | 105 | 1.4 | 1.5 | 11 | 1.2 | models | examples | data | more |
| agi_english | 2546 | 35 | 0.93 | 1.1 | 0.98 | 21 | models | examples | data | more |
| arc_challenge | 1165 | 38 | 1.4 | 1.3 | 14 | 11 | models | examples | data | more |
| gsm8k | 1319 | 37 | 1.2 | 1.3 | 2.1 | 0.53 | models | examples | data | more |
| hellaswag | 10042 | 36 | 0.41 | 0.26 | 6.1 | 1.6 | models | examples | data | more |
| humaneval | 164 | 78 | 3.5 | 3.8 | 3.7 | 1.2 | models | examples | data | more |
| humaneval+ | 164 | 49 | 3.7 | 3.9 | 4.3 | 1.8 | models | examples | data | more |
| lcb_codegen_v5 | 880 | 24 | 1.5 | 1.3 | 6.4 | 1.1 | models | examples | data | more |
| lcb_codegen_v6 | 1055 | 33 | 1.3 | 1.1 | 3.4 | 1.2 | models | examples | data | more |
| lcb_codegen_v6_080124 | 454 | 33 | 2.2 | 1.9 | 5.3 | 2 | models | examples | data | more |
| mbpp | 378 | 59 | 2.4 | 2.4 | 2.4 | 4 | models | examples | data | more |
| mbpp+ | 378 | 59 | 2.5 | 2.4 | 9.5 | 5.8 | models | examples | data | more |
| mmlu | 14042 | 36 | 0.39 | 0.44 | 0.6 | 12 | models | examples | data | more |
| nq | 3610 | 36 | 0.72 | 0.64 | 32 | 5.5 | models | examples | data | more |
| piqa | 1838 | 36 | 0.93 | 0.7 | 5.3 | 8.4 | models | examples | data | more |
| safim | 17410 | 22 | 0.36 | 0.36 | 16 | 7.9 | models | examples | data | more |
| siqa | 1954 | 36 | 1.1 | 0.86 | 15 | 19 | models | examples | data | more |
| swebench-bash-only | 500 | 28 | 2.1 | 2 | 13 | 2 | models | examples | data | more |
| swebench-lite | 300 | 84 | 2.7 | 2.6 | 12 | 1.7 | models | examples | data | more |
| swebench-multimodal | 510 | 12 | 2.1 | 1.4 | 51 | 5.3 | models | examples | data | more |
| swebench-test | 2294 | 23 | 0.87 | 0.77 | 43 | 1 | models | examples | data | more |
| swebench-verified | 500 | 131 | 2.1 | 1.9 | 6.8 | 1.6 | models | examples | data | more |
| terminal-bench-1.0 | 80 | 25 | 5.4 | 5.3 | 18 | 5 | models | examples | data | more |
| terminal-bench-2.0 | 89 | 7 | 5.1 | 5.8 | 9 | 12 | models | examples | data | more |
| tqa | 11313 | 36 | 0.43 | 0.35 | 12 | 2.9 | models | examples | data | more |
Generated on 2026-02-14 06:32:53