gpqa_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen3-32b 50.2 33.6 2 2.4 1.7 1.7
google_gemma_3_27b_it 47.3 31.8 1 2.4 NaN NaN
llama-3.1-70B-instruct 45.3 30.4 4 2.4 2.4 0
qwen3-14b 43.2 27.9 3 2.3 1.8 1.5
qwen2.5-coder-32b-instruct 39.7 25.5 2 2.3 1.6 1.7
qwen3-8b 39 24.8 3 2.3 1.8 1.5
qwen2-72b-instruct 36.5 23 2 2.3 1.6 1.6
qwen3-4b 35.8 22.7 4 2.3 1.8 1.3
qwen2-math-72b-instruct 35.3 22.7 2 2.3 1.1 2
google_gemma_3_12b_it 35 22.6 4 2.3 1.6 1.6
qwen1.5-32b-chat 32.3 20.2 2 2.2 1.4 1.7
qwen3-1.7b 32.1 21.1 4 2.2 1.6 1.5
qwen2.5-coder-14b-instruct 31.8 20 3 2.2 1.3 1.8
llama-3.1-8B-instruct 31.7 20.4 7 2.2 2.2 0
mistralai_mixtral_8x22b_instruct_v0.1 31.6 19.9 2 2.2 1.2 1.8
google_gemma_2_9b_it 30.6 18.8 3 2.2 1.6 1.5
mistralai_mixtral_8x7b_instruct_v0.1 30.4 19.5 3 2.2 1.3 1.8
qwen1.5-72b-chat 30.2 19 2 2.2 1.4 1.6
google_gemma_3_4b_it 30 19.6 5 2.2 1.5 1.6
mistralai_mathstral_7b_v0.1 29.7 19.3 4 2.2 0.99 1.9
qwen2-math-7b-instruct 27.9 18.9 4 2.1 0.87 1.9
qwen1.5-14b-chat 27.7 17.8 3 2.1 1.3 1.7
google_codegemma_1.1_7b_it 27.1 18.3 5 2.1 1 1.8
qwen2.5-coder-7b-instruct 26.7 17.5 4 2.1 0.9 1.9
mistralai_ministral_8b_instruct_2410 26.5 17.2 4 2.1 0.97 1.8
mistralai_mistral_7b_instruct_v0.3 26.5 16.8 4 2.1 1.2 1.7
deepseek_r1_distill_qwen_32b 26.3 15.9 2 2.1 1.7 1.3
llama-3.2-3B-instruct 26.1 17.4 10 2.1 2.1 0
google_gemma_7b_it 26.1 17.4 4 2.1 1.4 1.5
qwen2.5-coder-3b-instruct 25.7 17.8 4 2.1 0.53 2
deepseek_v2_lite_chat 24.6 16.5 2 2 0.89 1.8
google_gemma_3_1b_it 24.4 17.1 4 2 0.99 1.8
qwen2-7b-instruct 24.2 15.2 4 2 1.2 1.7
qwen3-0.6b 23.5 16.2 5 2 1.1 1.7
deepseek_r1_distill_qwen_14b 23.2 13.7 4 2 1.6 1.2
qwen1.5-7b-chat 22.9 15.2 3 2 0.82 1.8
mistralai_mistral_7b_instruct_v0.2 21.7 14.1 4 1.9 0.97 1.7
mistralai_mistral_7b_instruct_v0.1 21.3 14.8 4 1.9 0.58 1.8
deepseek_r1_distill_llama_70b 20.4 12 2 1.9 1.6 1.1
deepseek_r1_distill_llama_8b 19.8 11.5 4 1.9 1.4 1.2
deepseek_r1_distill_qwen_7b 19.7 11.5 4 1.9 1.4 1.2
qwen2-math-1.5b-instruct 17.3 12.1 4 1.8 0.54 1.7
qwen2.5-coder-0.5b-instruct 16.2 11.8 5 1.7 0 1.7
qwen2.5-coder-1.5b-instruct 16.2 11.5 4 1.7 0.22 1.7
google_gemma_2b_it 12.3 8.69 4 1.6 0.67 1.4
qwen2-1.5b-instruct 11 7.72 4 1.5 0.38 1.4
llama-3.2-1B-instruct 10.9 7.96 13 1.5 1.5 0
deepseek_r1_distill_qwen_1.5b 9.6 5.59 4 1.4 0.8 1.1
qwen1.5-1.8b-chat 8.04 5.81 3 1.3 0 1.3
qwen2-0.5b-instruct 7.81 5.68 5 1.3 0 1.3
qwen1.5-0.5b-chat 1.88 1.37 5 0.64 0.11 0.63