ap_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 92.5 31.9 11 0.99 0.81 0.56
qwen3-14b 90.3 30.3 11 1.1 0.96 0.56
qwen2-72b-instruct 89 29.6 9 1.2 0.96 0.68
llama-3.1-70B-instruct 88.7 30 11 1.2 1.2 0
google_gemma_3_27b_it 88.6 29.4 10 1.2 1 0.62
deepseek_r1_distill_llama_70b 88.6 29.8 9 1.2 0.95 0.72
deepseek_r1_distill_qwen_32b 87.2 28.9 9 1.3 0.96 0.8
qwen3-8b 86.8 28.3 11 1.3 1 0.72
qwen2.5-coder-32b-instruct 86.4 27.9 9 1.3 1 0.77
deepseek_r1_distill_qwen_14b 85.3 27.6 12 1.3 1 0.83
google_gemma_3_12b_it 85.2 27.2 11 1.3 1.1 0.7
qwen3-4b 83.3 26.7 11 1.4 1.2 0.72
google_gemma_2_27b_it 83.1 26.1 10 1.4 1.2 0.76
qwen1.5-72b-chat 81.5 25.4 9 1.5 1.1 0.91
google_gemma_2_9b_it 80.9 25 12 1.5 1.2 0.81
qwen2-math-72b-instruct 80.5 25.3 9 1.5 1.1 0.99
qwen2.5-coder-14b-instruct 80.5 24.8 11 1.5 1.2 0.94
qwen1.5-32b-chat 79.1 24.3 11 1.5 1.2 0.97
mistralai_mixtral_8x22b_instruct_v0.1 76.8 24.9 11 1.6 1.1 1.2
llama-3.1-8B-instruct 73.4 21.8 15 1.7 1.7 0
qwen2-7b-instruct 72.6 21.4 12 1.7 1.2 1.1
qwen1.5-14b-chat 71.8 20.7 11 1.7 1.4 0.98
mistralai_mixtral_8x7b_instruct_v0.1 71 20.9 11 1.7 1.3 1.1
qwen2.5-coder-7b-instruct 70.9 20.7 12 1.7 1.2 1.2
qwen3-1.7b 70.8 20.9 11 1.7 1.4 0.99
mistralai_ministral_8b_instruct_2410 69.2 20.1 12 1.7 1.2 1.3
google_gemma_3_4b_it 68.5 19.7 13 1.7 1.5 0.96
deepseek_r1_distill_llama_8b 63.2 18.2 12 1.8 1.2 1.4
qwen2-math-7b-instruct 61.9 18.6 11 1.8 1.2 1.3
qwen1.5-7b-chat 61.5 17.1 12 1.8 1.4 1.2
llama-3.2-3B-instruct 60.3 16.4 18 1.8 1.8 0
mistralai_mistral_7b_instruct_v0.3 59.7 16.5 12 1.8 1.4 1.2
mistralai_mathstral_7b_v0.1 57.6 15.9 12 1.9 1.2 1.4
deepseek_v2_lite_chat 57.2 15.8 11 1.9 1.3 1.3
qwen2.5-coder-3b-instruct 55.3 15.2 11 1.9 1.2 1.4
deepseek_r1_distill_qwen_7b 54.4 15.5 12 1.9 1.2 1.4
google_codegemma_1.1_7b_it 52.9 14.2 13 1.9 1.4 1.2
mistralai_mistral_7b_instruct_v0.2 52.3 14.2 12 1.9 1.4 1.3
mistralai_mistral_7b_instruct_v0.1 49 13.1 12 1.9 1.2 1.4
google_gemma_7b_it 48.1 13.1 12 1.9 1.4 1.2
qwen3-0.6b 47.2 12.9 12 1.9 1.4 1.3
qwen2-1.5b-instruct 39.2 10.4 11 1.8 1.2 1.4
google_gemma_3_1b_it 35.4 9.39 12 1.8 1.3 1.2
qwen2-math-1.5b-instruct 34.3 10.7 11 1.8 0.95 1.5
qwen2.5-coder-1.5b-instruct 33.9 9.75 11 1.8 0.81 1.6
google_gemma_2b_it 32.6 9.28 12 1.8 1.5 0.88
llama-3.2-1B-instruct 31.9 8.67 21 1.7 1.7 0
qwen1.5-1.8b-chat 30.5 8.29 12 1.7 0.98 1.4
deepseek_r1_distill_qwen_1.5b 26.5 7.34 12 1.7 0.81 1.4
qwen2-0.5b-instruct 22.3 7.08 13 1.6 0.79 1.3
qwen2.5-coder-0.5b-instruct 19.7 6.82 12 1.5 0.39 1.4
qwen1.5-0.5b-chat 19.7 6.23 12 1.5 0.68 1.3