aime2025_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
deepseek_r1_distill_llama_70b 26.7 22.2 2 8.1 6.6 4.7
deepseek_r1_distill_qwen_14b 23.3 19.1 4 7.7 7.4 2.4
deepseek_r1_distill_qwen_7b 23.3 19.1 4 7.7 7.4 2.4
deepseek_r1_distill_qwen_32b 21.7 17.8 2 7.5 7.1 2.4
qwen3-8b 21.1 17.1 3 7.5 6.1 4.3
google_gemma_3_27b_it 21.1 17.3 3 7.5 6.7 3.3
qwen3-14b 18.9 15.6 3 7.1 5.7 4.3
qwen2.5-coder-32b-instruct 18.3 15.4 2 7.1 5.8 4.1
google_gemma_3_12b_it 17.5 14 4 6.9 6 3.5
qwen3-32b 16.7 13.7 3 6.8 5.3 4.3
deepseek_r1_distill_llama_8b 15.8 12.4 4 6.7 6.5 1.7
qwen3-4b 14.2 11 4 6.4 5.2 3.7
qwen2-math-7b-instruct 13.3 10.7 2 6.2 4 4.7
deepseek_r1_distill_qwen_1.5b 11.7 9.51 4 5.9 5.4 2.4
google_gemma_3_4b_it 10.7 8.3 5 5.6 4.4 3.5
qwen3-1.7b 7.5 6.07 4 4.8 2.7 4
qwen2-math-72b-instruct 5 3.94 2 4 0 4.1
llama-3.2-3B-instruct 3.33 3.3 10 3.3 3.3 0
google_gemma_2_27b_it 3.33 2.63 2 3.3 0 3.3
qwen2.5-coder-14b-instruct 3.33 2.38 3 3.3 3.3 0
qwen2-math-1.5b-instruct 3.33 2.42 3 3.3 3.3 0
qwen2.5-coder-7b-instruct 1.67 1.31 4 2.3 0 2.4
google_codegemma_1.1_7b_it 1.33 1.33 5 2.1 1 1.8
mistralai_mixtral_8x7b_instruct_v0.1 1.11 1.11 3 1.9 0 1.9
mistralai_mixtral_8x22b_instruct_v0.1 1.11 0.98 3 1.9 0 1.9
mistralai_ministral_8b_instruct_2410 0.833 0.821 4 1.7 0 1.7
google_gemma_2_9b_it 0.833 0.821 4 1.7 0 1.7
qwen3-0.6b 0.667 0.586 5 1.5 0 1.5
llama-3.2-1B-instruct 0 0 13 0 0 0
llama-3.1-8B-instruct 0 0 7 0 0 0
llama-3.1-70B-instruct 0 0 4 0 0 0
google_gemma_7b_it 0 0 4 0 0 0
google_gemma_3_1b_it 0 0 4 0 0 0
google_gemma_2b_it 0 0 4 0 0 0
deepseek_v2_lite_chat 0 0 3 0 0 0
mistralai_mathstral_7b_v0.1 0 0 4 0 0 0
qwen2-72b-instruct 0 0 2 0 0 0
qwen2-1.5b-instruct 0 0 4 0 0 0
qwen2-0.5b-instruct 0 0 5 0 0 0
qwen1.5-7b-chat 0 0 3 0 0 0
qwen1.5-72b-chat 0 0 2 0 0 0
qwen1.5-32b-chat 0 0 3 0 0 0
qwen1.5-14b-chat 0 0 3 0 0 0
qwen1.5-1.8b-chat 0 0 3 0 0 0
qwen1.5-0.5b-chat 0 0 5 0 0 0
mistralai_mistral_7b_instruct_v0.3 0 0 4 0 0 0
mistralai_mistral_7b_instruct_v0.1 0 0 4 0 0 0
mistralai_mistral_7b_instruct_v0.2 0 0 4 0 0 0
qwen2.5-coder-0.5b-instruct 0 0 5 0 0 0
qwen2.5-coder-1.5b-instruct 0 0 4 0 0 0
qwen2-7b-instruct 0 0 4 0 0 0
qwen2.5-coder-3b-instruct 0 0 4 0 0 0