gre_physics_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen2-math-72b-instruct 78.3 43.7 11 4.8 3.7 3
qwen2-72b-instruct 77.3 43.2 11 4.8 4.6 1.6
google_gemma_3_27b_it 74.4 42.2 12 5 3.9 3.1
qwen3-8b 72 39.2 12 5.2 4.3 2.9
qwen3-4b 71.9 39.7 12 5.2 4.1 3.2
qwen3-14b 71.1 38.7 12 5.2 4.6 2.5
qwen3-32b 67.9 37.6 11 5.4 4.1 3.5
qwen2.5-coder-32b-instruct 62.4 33.2 11 5.6 4.3 3.6
qwen1.5-72b-chat 60.1 32.3 11 5.7 5.1 2.5
google_gemma_2_27b_it 59.2 32.3 10 5.7 3.6 4.4
qwen1.5-32b-chat 58.4 29.9 11 5.7 5.1 2.5
mistralai_mixtral_8x22b_instruct_v0.1 58.1 29.8 11 5.7 4.5 3.5
qwen2.5-coder-14b-instruct 56.2 28.6 12 5.7 4.7 3.3
qwen2.5-coder-7b-instruct 55.2 28.5 10 5.7 4.3 3.8
google_gemma_3_12b_it 54.3 28.6 11 5.8 4.4 3.7
qwen1.5-14b-chat 52.7 26.5 12 5.8 4.6 3.5
qwen2-math-7b-instruct 51.8 26.8 6 5.8 4.7 3.3
qwen3-1.7b 48.2 24.2 12 5.8 4.3 3.8
qwen1.5-7b-chat 46.3 24.1 12 5.8 4.8 3.2
qwen2-7b-instruct 46.1 23.2 11 5.8 4.6 3.5
mistralai_mathstral_7b_v0.1 45.6 23.3 11 5.8 4.1 4
mistralai_mistral_7b_instruct_v0.3 41 19.5 11 5.7 4.9 2.8
qwen3-0.6b 39.8 19.9 13 5.7 4.5 3.4
deepseek_v2_lite_chat 39.2 19.9 11 5.6 3.7 4.3
google_gemma_2_9b_it 38.8 20 11 5.6 3.8 4.1
mistralai_ministral_8b_instruct_2410 38.3 18.9 11 5.6 3.3 4.5
qwen2-math-1.5b-instruct 37.3 19.6 4 5.6 4 3.9
llama-3.2-3B-instruct 37.3 17.5 17 5.6 5.6 0
llama-3.1-8B-instruct 37.3 17.9 15 5.6 5.6 0
google_gemma_3_4b_it 37 19.2 13 5.6 4.4 3.4
qwen2.5-coder-3b-instruct 35.3 17.6 12 5.5 3.2 4.5
mistralai_mixtral_8x7b_instruct_v0.1 29.8 15.2 12 5.3 3.4 4
qwen2.5-coder-1.5b-instruct 25.6 13.8 11 5 1.3 4.9
google_gemma_7b_it 24.4 11.3 13 5 4.1 2.8
qwen2-1.5b-instruct 24.3 12.3 13 5 2.9 4
mistralai_mistral_7b_instruct_v0.2 24 10.8 10 4.9 3.6 3.4
google_codegemma_1.1_7b_it 23.4 11.4 13 4.9 3.3 3.6
mistralai_mistral_7b_instruct_v0.1 20 10.2 11 4.6 2.3 4
qwen1.5-0.5b-chat 19.5 11.4 13 4.6 1.2 4.4
llama-3.2-1B-instruct 18.7 9.57 12 4.5 4.5 0
qwen1.5-1.8b-chat 18.3 9.79 11 4.5 1.5 4.2
qwen2.5-coder-0.5b-instruct 18.3 11.1 13 4.5 1.1 4.3
deepseek_r1_distill_llama_70b 18.2 8.64 11 4.5 2.5 3.7
qwen2-0.5b-instruct 18.2 10.6 13 4.5 1.2 4.3
google_gemma_3_1b_it 17.9 11.2 13 4.4 3.2 3.1
google_gemma_2b_it 17.1 9.38 13 4.4 3.2 2.9
deepseek_r1_distill_qwen_14b 14.7 6.69 11 4.1 1.9 3.6
deepseek_r1_distill_qwen_32b 12.8 5.98 11 3.9 1.6 3.5
deepseek_r1_distill_llama_8b 11.3 4.77 13 3.7 1.8 3.2
deepseek_r1_distill_qwen_7b 6.79 3.09 11 2.9 0.9 2.8
deepseek_r1_distill_qwen_1.5b 4 1.7 13 2.3 0.49 2.2