gre_physics_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen2-72b-instruct 79.3 46.4 2 4.7 4.4 1.6
google_gemma_3_27b_it 72 42 3 5.2 4.1 3.2
qwen2-math-72b-instruct 71.3 41.2 2 5.2 2.9 4.3
qwen3-8b 70.2 40.1 3 5.3 4.1 3.3
qwen3-4b 70 39.9 4 5.3 3.8 3.7
qwen3-14b 69.3 39 3 5.3 4.6 2.7
qwen3-32b 65.8 37.1 3 5.5 3.7 4
qwen2.5-coder-32b-instruct 62 34.8 2 5.6 4.3 3.7
google_gemma_2_27b_it 58 31.9 2 5.7 3.9 4.1
google_gemma_3_12b_it 57.3 31.9 4 5.7 4.1 3.9
qwen2.5-coder-14b-instruct 56.9 30.7 3 5.7 4.3 3.8
mistralai_mixtral_8x22b_instruct_v0.1 56 30.6 3 5.7 4 4.1
qwen1.5-72b-chat 55.3 30.3 2 5.7 4.6 3.4
qwen1.5-32b-chat 55.1 29.4 3 5.7 4.8 3.1
qwen1.5-14b-chat 50.7 27.2 3 5.8 4.3 3.8
qwen2.5-coder-7b-instruct 48 26.4 4 5.8 3.4 4.7
qwen2-math-7b-instruct 47.3 25.5 2 5.8 4.3 3.9
qwen2-7b-instruct 46 24.6 4 5.8 4.3 3.8
qwen3-1.7b 44.3 23 4 5.7 4.4 3.7
google_gemma_2_9b_it 44 24.6 4 5.7 3.4 4.6
qwen1.5-7b-chat 43.1 23.6 3 5.7 4.2 3.9
mistralai_mathstral_7b_v0.1 39.3 20.7 4 5.6 3.7 4.3
llama-3.1-8B-instruct 38.7 20.8 7 5.6 5.6 0
deepseek_v2_lite_chat 37.8 21.1 3 5.6 3.4 4.4
mistralai_mistral_7b_instruct_v0.3 37.7 18.9 4 5.6 4.4 3.4
qwen3-0.6b 37.3 20.3 5 5.6 3.6 4.2
mistralai_ministral_8b_instruct_2410 36.7 19.8 4 5.6 2.5 5
google_gemma_3_4b_it 36 19.6 5 5.5 4.2 3.6
llama-3.2-3B-instruct 32 16 10 5.4 5.4 0
qwen2.5-coder-3b-instruct 31.7 16.5 4 5.4 2.9 4.5
qwen2-math-1.5b-instruct 29.8 16.2 3 5.3 3.3 4.1
mistralai_mixtral_8x7b_instruct_v0.1 27.6 15.3 3 5.2 3 4.2
google_codegemma_1.1_7b_it 26.1 14.2 5 5.1 2.8 4.2
qwen2.5-coder-1.5b-instruct 25 14.3 4 5 0 5
google_gemma_7b_it 23.7 11.9 4 4.9 3.8 3.1
mistralai_mistral_7b_instruct_v0.2 23.3 11.5 4 4.9 3.2 3.7
qwen2-1.5b-instruct 21.7 11.7 4 4.8 2 4.3
mistralai_mistral_7b_instruct_v0.1 20 10.5 4 4.6 1.9 4.2
llama-3.2-1B-instruct 18.7 10.1 13 4.5 4.5 0
google_gemma_2b_it 17.7 10.7 4 4.4 2.9 3.3
qwen1.5-0.5b-chat 17.3 11.1 5 4.4 1.2 4.2
deepseek_r1_distill_llama_70b 16.7 8.03 2 4.3 3 3.1
qwen2.5-coder-0.5b-instruct 16.3 10.6 5 4.3 0.17 4.3
qwen2-0.5b-instruct 16 9.53 5 4.2 0.57 4.2
google_gemma_3_1b_it 16 10.2 4 4.2 2.7 3.3
qwen1.5-1.8b-chat 13.8 7.67 3 4 1 3.8
deepseek_r1_distill_qwen_32b 12.7 6.6 2 3.8 1.2 3.7
deepseek_r1_distill_qwen_14b 8.67 4.49 4 3.2 0.69 3.2
deepseek_r1_distill_qwen_7b 7.33 3.17 4 3 1.4 2.7
deepseek_r1_distill_llama_8b 7 3.19 4 2.9 0.91 2.8
deepseek_r1_distill_qwen_1.5b 4.67 2.2 4 2.4 0.77 2.3