ap_cot: by models

Home


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 92.5 97.7 31.9 11 0.99 0.81 0.56
qwen3-14b 90.3 95.6 30.3 11 1.1 0.96 0.56
qwen2-72b-instruct 89 95.6 29.6 9 1.2 0.96 0.68
llama-3.1-70B-instruct 88.7 88.7 30 11 1.2 1.2 0
google_gemma_3_27b_it 88.6 94.1 29.4 10 1.2 1 0.62
deepseek_r1_distill_llama_70b 88.6 95.9 29.8 9 1.2 0.95 0.72
deepseek_r1_distill_qwen_32b 87.2 96.5 28.9 9 1.3 0.96 0.8
qwen3-8b 86.8 95.8 28.3 11 1.3 1 0.72
qwen2.5-coder-32b-instruct 86.4 95.5 27.9 9 1.3 1 0.77
deepseek_r1_distill_qwen_14b 85.3 95.5 27.6 12 1.3 1 0.83
google_gemma_3_12b_it 85.2 93.1 27.2 11 1.3 1.1 0.7
qwen3-4b 83.3 92.3 26.7 11 1.4 1.2 0.72
google_gemma_2_27b_it 83.1 93 26.1 10 1.4 1.2 0.76
qwen1.5-72b-chat 81.5 93.8 25.4 9 1.5 1.1 0.91
google_gemma_2_9b_it 80.9 92.4 25 12 1.5 1.2 0.81
qwen2-math-72b-instruct 80.5 94.8 25.3 9 1.5 1.1 0.99
qwen2.5-coder-14b-instruct 80.5 94.1 24.8 11 1.5 1.2 0.94
qwen1.5-32b-chat 79.1 94.1 24.3 11 1.5 1.2 0.97
mistralai_mixtral_8x22b_instruct_v0.1 76.8 94.4 24.9 11 1.6 1.1 1.2
llama-3.1-8B-instruct 73.4 73.4 21.8 15 1.7 1.7 0
qwen2-7b-instruct 72.6 93.7 21.4 12 1.7 1.2 1.1
qwen1.5-14b-chat 71.8 89 20.7 11 1.7 1.4 0.98
mistralai_mixtral_8x7b_instruct_v0.1 71 92.4 20.9 11 1.7 1.3 1.1
qwen2.5-coder-7b-instruct 70.9 94.7 20.7 12 1.7 1.2 1.2
qwen3-1.7b 70.8 87.6 20.9 11 1.7 1.4 0.99
mistralai_ministral_8b_instruct_2410 69.2 95.8 20.1 12 1.7 1.2 1.3
google_gemma_3_4b_it 68.5 86.5 19.7 13 1.7 1.5 0.96
deepseek_r1_distill_llama_8b 63.2 92.5 18.2 12 1.8 1.2 1.4
qwen2-math-7b-instruct 61.9 92.7 18.6 11 1.8 1.2 1.3
qwen1.5-7b-chat 61.5 90 17.1 12 1.8 1.4 1.2
llama-3.2-3B-instruct 60.3 60.3 16.4 18 1.8 1.8 0
mistralai_mistral_7b_instruct_v0.3 59.7 86.9 16.5 12 1.8 1.4 1.2
mistralai_mathstral_7b_v0.1 57.6 92.3 15.9 12 1.9 1.2 1.4
deepseek_v2_lite_chat 57.2 90.3 15.8 11 1.9 1.3 1.3
qwen2.5-coder-3b-instruct 55.3 93 15.2 11 1.9 1.2 1.4
deepseek_r1_distill_qwen_7b 54.4 88.6 15.5 12 1.9 1.2 1.4
google_codegemma_1.1_7b_it 52.9 83.4 14.2 13 1.9 1.4 1.2
mistralai_mistral_7b_instruct_v0.2 52.3 80.7 14.2 12 1.9 1.4 1.3
mistralai_mistral_7b_instruct_v0.1 49 88.9 13.1 12 1.9 1.2 1.4
google_gemma_7b_it 48.1 77.4 13.1 12 1.9 1.4 1.2
qwen3-0.6b 47.2 82.3 12.9 12 1.9 1.4 1.3
qwen2-1.5b-instruct 39.2 84.1 10.4 11 1.8 1.2 1.4
google_gemma_3_1b_it 35.4 71.3 9.39 12 1.8 1.3 1.2
qwen2-math-1.5b-instruct 34.3 85.7 10.7 11 1.8 0.95 1.5
qwen2.5-coder-1.5b-instruct 33.9 88.6 9.75 11 1.8 0.81 1.6
google_gemma_2b_it 32.6 50.9 9.28 12 1.8 1.5 0.88
llama-3.2-1B-instruct 31.9 31.9 8.67 21 1.7 1.7 0
qwen1.5-1.8b-chat 30.5 79.5 8.29 12 1.7 0.98 1.4
deepseek_r1_distill_qwen_1.5b 26.5 77.6 7.34 12 1.7 0.81 1.4
qwen2-0.5b-instruct 22.3 78.1 7.08 13 1.6 0.79 1.3
qwen2.5-coder-0.5b-instruct 19.7 85.9 6.82 12 1.5 0.39 1.4
qwen1.5-0.5b-chat 19.7 73.4 6.23 12 1.5 0.68 1.3