ap_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 pass@count win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 92 94.2 33.7 2 1 0.85 0.56
qwen3-14b 89.7 92.8 32.1 3 1.1 0.98 0.57
google_gemma_3_27b_it 88.7 92.5 31.6 3 1.2 0.99 0.65
qwen2-72b-instruct 88.7 92 31.6 2 1.2 0.97 0.67
llama-3.1-70B-instruct 87.8 87.8 31.2 4 1.2 1.2 0
deepseek_r1_distill_llama_70b 87.4 91.8 31 2 1.2 0.96 0.79
qwen3-8b 87.1 93 30.5 3 1.3 1 0.75
qwen2.5-coder-32b-instruct 85.8 90.4 29.6 2 1.3 1 0.81
google_gemma_3_12b_it 85 89.3 29 3 1.3 1.2 0.64
google_gemma_2_27b_it 83.9 88.5 28.7 2 1.4 1.1 0.8
qwen3-4b 83.4 90.3 28.8 4 1.4 1.2 0.74
deepseek_r1_distill_qwen_32b 82.8 88.6 28.6 2 1.4 1.1 0.9
deepseek_r1_distill_qwen_14b 82.4 91.8 27.9 4 1.4 1.1 0.9
qwen1.5-72b-chat 81.9 88.7 27.5 2 1.4 1.1 0.98
google_gemma_2_9b_it 80.1 87.1 26.4 3 1.5 1.2 0.84
qwen2.5-coder-14b-instruct 79.1 91 26.2 3 1.5 1.1 1.1
qwen1.5-32b-chat 79 85.7 26 2 1.5 1.2 0.96
mistralai_mixtral_8x22b_instruct_v0.1 77.4 87.1 27 2 1.6 1.1 1.2
qwen2-math-72b-instruct 75.1 86.8 25.2 2 1.6 0.99 1.3
mistralai_mixtral_8x7b_instruct_v0.1 72.1 84.7 23 3 1.7 1.3 1.1
qwen3-1.7b 71.4 83.8 23 4 1.7 1.4 0.99
qwen1.5-14b-chat 69.9 80.3 21.7 3 1.7 1.4 1
llama-3.1-8B-instruct 69.1 69.1 21.2 7 1.7 1.7 0
qwen2-7b-instruct 68.6 88.6 21.5 4 1.7 1.2 1.3
google_gemma_3_4b_it 68.3 83.3 21.4 5 1.7 1.4 1
qwen2.5-coder-7b-instruct 67 87.5 21 4 1.8 1.2 1.3
mistralai_ministral_8b_instruct_2410 66.5 85.1 20.6 3 1.8 1.1 1.4
deepseek_r1_distill_llama_8b 60.7 85.1 19 4 1.8 1.1 1.5
mistralai_mistral_7b_instruct_v0.3 59.9 79 18.1 4 1.8 1.3 1.3
qwen1.5-7b-chat 59.8 76.7 17.8 3 1.8 1.4 1.2
qwen2-math-7b-instruct 58.7 85 18.9 4 1.8 1.2 1.4
llama-3.2-3B-instruct 58.5 58.5 17.6 10 1.8 1.8 0
deepseek_v2_lite_chat 54.6 66.9 16.3 2 1.9 1.3 1.3
mistralai_mathstral_7b_v0.1 52.9 81.9 15.9 4 1.9 1.1 1.5
google_codegemma_1.1_7b_it 52.3 76.2 15.5 5 1.9 1.4 1.3
mistralai_mistral_7b_instruct_v0.2 52.2 74 15.4 4 1.9 1.3 1.3
qwen2.5-coder-3b-instruct 51.8 82.1 15.9 4 1.9 1.1 1.5
deepseek_r1_distill_qwen_7b 49.4 76.4 14.8 4 1.9 1.2 1.5
google_gemma_7b_it 47.9 67.7 14.4 4 1.9 1.4 1.3
qwen3-0.6b 47.2 76.8 14.4 5 1.9 1.3 1.4
mistralai_mistral_7b_instruct_v0.1 41.7 72 12 4 1.8 1.1 1.5
google_gemma_3_1b_it 34.8 58.9 10.1 4 1.8 1.2 1.3
google_gemma_2b_it 32.2 45.4 9.9 4 1.8 1.4 0.99
qwen2-1.5b-instruct 31.2 62.6 9.15 4 1.7 0.99 1.4
qwen2.5-coder-1.5b-instruct 31.1 68.8 9.98 4 1.7 0.69 1.6
llama-3.2-1B-instruct 29.3 29.3 8.74 13 1.7 1.7 0
qwen1.5-1.8b-chat 24 46.6 7.02 3 1.6 0.81 1.4
deepseek_r1_distill_qwen_1.5b 21.8 50.4 6.34 4 1.5 0.73 1.4
qwen2-math-1.5b-instruct 21.8 52.3 7.79 4 1.5 0.71 1.4
qwen2-0.5b-instruct 18.5 56 6.14 5 1.5 0.49 1.4
qwen2.5-coder-0.5b-instruct 18.1 58.6 6.47 5 1.4 0.35 1.4
qwen1.5-0.5b-chat 15.1 48.5 5.34 5 1.3 0.5 1.2