The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.
| model | pass1 | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|
| google_gemma_3_27b_it | 86.5 | 43.3 | 8 | 1.5 | 1.2 | 0.9 |
| deepseek_r1_distill_qwen_32b | 84.8 | 42.1 | 1.1e+03 | 1.6 | 1.2 | 1.1 |
| qwen3-32b | 82.3 | 39.9 | 7e+02 | 1.7 | 1.3 | 1.1 |
| qwen3-14b | 82.3 | 39.8 | 1.1e+03 | 1.7 | 1.4 | 1 |
| deepseek_r1_distill_llama_70b | 82.1 | 40.1 | 1.1e+03 | 1.7 | 1.3 | 1.2 |
| qwen2-math-72b-instruct | 81.1 | 38.7 | 35 | 1.7 | 1.5 | 0.98 |
| google_gemma_3_12b_it | 80.2 | 38.3 | 1.1e+03 | 1.8 | 1.5 | 1 |
| qwen3-8b | 79.7 | 37.8 | 1.1e+03 | 1.8 | 1.4 | 1.1 |
| qwen3-4b | 78.2 | 36.8 | 1.1e+03 | 1.8 | 1.5 | 1.1 |
| deepseek_r1_distill_qwen_7b | 78.1 | 37.2 | 1.1e+03 | 1.8 | 1.4 | 1.2 |
| qwen2.5-coder-32b-instruct | 77 | 35.7 | 1.1e+03 | 1.9 | 1.5 | 1.1 |
| qwen2.5-coder-14b-instruct | 72.6 | 32.7 | 1.1e+03 | 2 | 1.6 | 1.2 |
| deepseek_r1_distill_llama_8b | 70.5 | 32.8 | 71 | 2 | 1.5 | 1.4 |
| deepseek_r1_distill_qwen_1.5b | 68.7 | 30.7 | 1.1e+03 | 2.1 | 1.5 | 1.4 |
| qwen3-1.7b | 67.9 | 29.9 | 1.1e+03 | 2.1 | 1.6 | 1.3 |
| llama-3.1-70B-instruct | 66.4 | 29.1 | 1e+03 | 2.1 | 1.7 | 1.3 |
| google_gemma_3_4b_it | 65.7 | 29.4 | 1.1e+03 | 2.1 | 1.7 | 1.3 |
| qwen2-72b-instruct | 65.3 | 27.9 | 2.2e+02 | 2.1 | 1.7 | 1.3 |
| qwen2.5-coder-7b-instruct | 62.4 | 26.5 | 1.1e+03 | 2.2 | 1.6 | 1.5 |
| google_gemma_2_27b_it | 53.1 | 20.8 | 1.1e+03 | 2.2 | 1.9 | 1.2 |
| qwen2-7b-instruct | 52.5 | 20.5 | 1.1e+03 | 2.2 | 1.7 | 1.5 |
| mistralai_ministral_8b_instruct_2410 | 49.3 | 19 | 1.1e+03 | 2.2 | 1.6 | 1.5 |
| mistralai_mathstral_7b_v0.1 | 48.5 | 18.6 | 1.1e+03 | 2.2 | 1.6 | 1.5 |
| llama-3.1-8B-instruct | 48.4 | 18.7 | 1.1e+03 | 2.2 | 1.7 | 1.5 |
| mistralai_mixtral_8x22b_instruct_v0.1 | 47.8 | 18.4 | 1e+02 | 2.2 | 1.6 | 1.5 |
| qwen2.5-coder-3b-instruct | 47.4 | 18.7 | 1.1e+03 | 2.2 | 1.5 | 1.7 |
| google_gemma_2_9b_it | 47.4 | 17.6 | 1.1e+03 | 2.2 | 1.9 | 1.2 |
| llama-3.2-3B-instruct | 44.5 | 16.7 | 1.1e+03 | 2.2 | 1.7 | 1.4 |
| qwen1.5-72b-chat | 40.6 | 14.9 | 3.9e+02 | 2.2 | 1.6 | 1.5 |
| qwen1.5-32b-chat | 39.9 | 14.6 | 30 | 2.2 | 1.6 | 1.5 |
| qwen3-0.6b | 33.4 | 12.1 | 1.1e+03 | 2.1 | 1.4 | 1.5 |
| qwen2.5-coder-1.5b-instruct | 33.1 | 11.7 | 1.1e+03 | 2.1 | 1.4 | 1.6 |
| qwen1.5-14b-chat | 30.9 | 10.4 | 1.1e+03 | 2.1 | 1.5 | 1.5 |
| llama-3.2-1B-instruct | 26.6 | 8.92 | 1.1e+03 | 2 | 1.4 | 1.4 |
| google_codegemma_1.1_7b_it | 21.9 | 7.04 | 1.1e+03 | 1.9 | 1.3 | 1.3 |
| deepseek_v2_lite_chat | 21.4 | 6.98 | 1.1e+03 | 1.8 | 1.1 | 1.4 |
| qwen1.5-7b-chat | 17 | 5.27 | 1.1e+03 | 1.7 | 1.1 | 1.3 |
| google_gemma_3_1b_it | 14.5 | 6.01 | 1.1e+03 | 1.6 | 1 | 1.2 |
| mistralai_mistral_7b_instruct_v0.3 | 13.9 | 4.34 | 1.1e+03 | 1.5 | 0.97 | 1.2 |
| mistralai_mistral_7b_instruct_v0.2 | 10.4 | 3.22 | 9.2e+02 | 1.4 | 0.79 | 1.1 |
| qwen2.5-coder-0.5b-instruct | 8.66 | 2.77 | 1.1e+03 | 1.3 | 0.68 | 1.1 |
| qwen2-1.5b-instruct | 7.18 | 2.17 | 1.1e+03 | 1.2 | 0.51 | 1 |
| mistralai_mistral_7b_instruct_v0.1 | 6.04 | 1.88 | 1.1e+03 | 1.1 | 0.48 | 0.95 |
| google_gemma_7b_it | 5.78 | 1.72 | 1.1e+03 | 1 | 0.68 | 0.79 |
| qwen2-0.5b-instruct | 3.34 | 1.13 | 1.1e+03 | 0.8 | 0.28 | 0.75 |
| qwen1.5-1.8b-chat | 1.53 | 0.539 | 1.1e+03 | 0.55 | 0.14 | 0.53 |
| qwen1.5-0.5b-chat | 0.92 | 0.386 | 8.9e+02 | 0.43 | 0.089 | 0.42 |
| google_gemma_2b_it | 0.196 | 0.0683 | 1.1e+03 | 0.2 | 0.041 | 0.19 |