gpqa_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 49.4 31.9 11 2.4 1.7 1.6
llama-3.1-70B-instruct 43.8 27.6 13 2.3 2.3 0
google_gemma_3_27b_it 43.4 27.5 7 2.3 1.7 1.6
qwen3-14b 41.7 25.7 11 2.3 1.9 1.4
qwen2-math-72b-instruct 40.6 25.4 10 2.3 1.5 1.7
qwen3-8b 40.1 24.8 11 2.3 1.8 1.4
qwen2.5-coder-32b-instruct 39.5 24.3 10 2.3 1.7 1.6
qwen2-72b-instruct 39 23.5 10 2.3 1.8 1.5
google_gemma_3_12b_it 37.4 23.4 12 2.3 1.6 1.6
qwen3-4b 36.7 22.5 12 2.3 1.8 1.3
qwen2.5-coder-14b-instruct 34.6 21.3 11 2.2 1.3 1.8
qwen1.5-32b-chat 32.5 19.7 11 2.2 1.4 1.7
mistralai_mixtral_8x22b_instruct_v0.1 32.2 19.6 11 2.2 1.4 1.7
llama-3.1-8B-instruct 31.7 19.9 15 2.2 2.2 0
qwen3-1.7b 31.6 20.2 12 2.2 1.6 1.5
qwen1.5-72b-chat 31.5 19 10 2.2 1.5 1.6
deepseek_r1_distill_qwen_32b 31.5 19 10 2.2 1.8 1.3
qwen2.5-coder-7b-instruct 30.9 19.5 11 2.2 1.1 1.9
mistralai_mathstral_7b_v0.1 30.8 19 11 2.2 1.2 1.8
mistralai_ministral_8b_instruct_2410 30.8 19.3 12 2.2 1.2 1.9
google_gemma_2_9b_it 30.7 18.1 12 2.2 1.6 1.5
google_gemma_3_4b_it 30.4 19.2 13 2.2 1.5 1.6
mistralai_mixtral_8x7b_instruct_v0.1 29.8 18 11 2.2 1.3 1.7
llama-3.2-3B-instruct 29.7 18.7 17 2.2 2.2 0
deepseek_r1_distill_qwen_14b 28.6 17 11 2.1 1.7 1.3
qwen1.5-14b-chat 28.6 17.6 11 2.1 1.4 1.6
qwen2-math-7b-instruct 28.5 18.3 12 2.1 1.1 1.9
google_codegemma_1.1_7b_it 28.1 18.2 13 2.1 1.2 1.7
qwen2-7b-instruct 27.7 17.1 11 2.1 1.2 1.7
mistralai_mistral_7b_instruct_v0.3 27.4 16.8 11 2.1 1.3 1.7
qwen2.5-coder-3b-instruct 27 17.7 12 2.1 0.75 2
google_gemma_7b_it 26.7 17.1 12 2.1 1.5 1.5
deepseek_r1_distill_llama_70b 26.3 15.3 10 2.1 1.7 1.2
mistralai_mistral_7b_instruct_v0.1 25.2 16.7 11 2.1 0.85 1.9
google_gemma_3_1b_it 24.6 16.7 12 2 1 1.7
deepseek_v2_lite_chat 24.5 15.8 11 2 0.99 1.8
deepseek_r1_distill_qwen_7b 24.4 14.3 11 2 1.5 1.3
mistralai_mistral_7b_instruct_v0.2 24.1 14.8 11 2 1.2 1.7
deepseek_r1_distill_llama_8b 23.8 13.6 12 2 1.5 1.4
qwen3-0.6b 23.6 15.6 13 2 1.2 1.6
qwen2-math-1.5b-instruct 23.1 15.3 11 2 0.79 1.8
qwen1.5-7b-chat 20.4 12.9 12 1.9 0.85 1.7
qwen2.5-coder-1.5b-instruct 19.1 13.1 12 1.9 0.34 1.8
qwen2.5-coder-0.5b-instruct 18 13 13 1.8 0.36 1.8
llama-3.2-1B-instruct 15.4 10.8 12 1.7 1.7 0
qwen2-1.5b-instruct 12.5 8.35 12 1.6 0.36 1.5
google_gemma_2b_it 12.1 8.34 12 1.5 0.72 1.4
qwen2-0.5b-instruct 12 8.48 13 1.5 0.28 1.5
deepseek_r1_distill_qwen_1.5b 10.9 6.08 12 1.5 0.81 1.2
qwen1.5-0.5b-chat 8.6 6.18 13 1.3 0.2 1.3
qwen1.5-1.8b-chat 7.92 5.37 12 1.3 0.29 1.2