aime2024_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
deepseek_r1_distill_llama_70b 38.3 33.3 2 8.9 5.4 7.1
deepseek_r1_distill_qwen_32b 35 30 2 8.7 6.1 6.2
google_gemma_3_27b_it 27.8 23.1 3 8.2 5.5 6.1
deepseek_r1_distill_qwen_7b 25.8 21.8 4 8 6.1 5.2
qwen3-32b 25.6 20.8 3 8 5.8 5.4
qwen3-14b 23.3 18.8 3 7.7 6.7 3.8
deepseek_r1_distill_qwen_14b 21.7 18 4 7.5 5.5 5.1
google_gemma_3_12b_it 19.2 15.1 4 7.2 6.1 3.7
qwen3-8b 17.8 14.2 3 7 5.1 4.7
qwen3-4b 17.5 13.6 4 6.9 5.9 3.7
llama-3.1-70B-instruct 16.7 13.6 4 6.8 6.8 0
deepseek_r1_distill_qwen_1.5b 13.3 10.4 4 6.2 4.9 3.8
deepseek_r1_distill_llama_8b 13.3 10.4 4 6.2 4.5 4.3
qwen2.5-coder-32b-instruct 11.7 9.03 2 5.9 2.6 5.3
google_gemma_3_4b_it 8 5.68 5 5 4 3
qwen2.5-coder-14b-instruct 7.78 6.31 3 4.9 4.5 1.9
llama-3.2-3B-instruct 6.67 4.59 10 4.6 4.6 0
llama-3.1-8B-instruct 6.67 6.31 7 4.6 4.6 0
qwen2.5-coder-7b-instruct 4.17 2.97 4 3.6 2.2 2.9
qwen3-1.7b 4.17 2.87 4 3.6 3.2 1.7
qwen2-math-7b-instruct 3.33 2.26 2 3.3 0 3.3
qwen2-72b-instruct 3.33 2.68 2 3.3 0 3.3
qwen2-math-1.5b-instruct 3.33 2.32 3 3.3 3.3 0
llama-3.2-1B-instruct 3.33 3 13 3.3 3.3 0
mistralai_mixtral_8x7b_instruct_v0.1 2.22 2.14 3 2.7 0 2.7
google_gemma_2_27b_it 1.67 1.14 2 2.3 0 2.4
google_gemma_3_1b_it 1.67 1.53 4 2.3 0 2.4
mistralai_mathstral_7b_v0.1 1.67 1.43 4 2.3 1.3 1.9
qwen2.5-coder-3b-instruct 1.67 1.28 4 2.3 0 2.4
qwen1.5-14b-chat 1.11 0.757 3 1.9 0 1.9
mistralai_mixtral_8x22b_instruct_v0.1 1.11 0.743 3 1.9 0 1.9
google_gemma_2_9b_it 0.833 0.556 4 1.7 0 1.7
qwen2.5-coder-1.5b-instruct 0.833 0.71 4 1.7 0 1.7
qwen2.5-coder-0.5b-instruct 0.667 0.453 5 1.5 0 1.5
google_codegemma_1.1_7b_it 0.667 0.453 5 1.5 0 1.5
google_gemma_7b_it 0 0 4 0 0 0
google_gemma_2b_it 0 0 4 0 0 0
deepseek_v2_lite_chat 0 0 3 0 0 0
qwen1.5-7b-chat 0 0 3 0 0 0
qwen2-1.5b-instruct 0 0 4 0 0 0
qwen2-0.5b-instruct 0 0 5 0 0 0
qwen1.5-0.5b-chat 0 0 5 0 0 0
mistralai_mistral_7b_instruct_v0.2 0 0 4 0 0 0
mistralai_ministral_8b_instruct_2410 0 0 4 0 0 0
mistralai_mistral_7b_instruct_v0.3 0 0 4 0 0 0
mistralai_mistral_7b_instruct_v0.1 0 0 4 0 0 0
qwen1.5-72b-chat 0 0 2 0 0 0
qwen1.5-32b-chat 0 0 3 0 0 0
qwen1.5-1.8b-chat 0 0 3 0 0 0
qwen2-7b-instruct 0 0 4 0 0 0
qwen3-0.6b 0 0 5 0 0 0