cruxeval_input_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen2.5-coder-32b-instruct 74.2 48.8 2 1.5 1.2 1
google_gemma_3_27b_it 68.9 44.1 3 1.6 1.3 1
qwen2.5-coder-14b-instruct 62.7 39.7 3 1.7 1.1 1.3
google_gemma_3_12b_it 61.3 37.7 4 1.7 1.3 1.1
qwen2-72b-instruct 53.8 32.2 2 1.8 1.2 1.3
llama-3.1-70B-instruct 53.1 32.1 4 1.8 1.8 0
google_gemma_2_27b_it 52.8 31.4 2 1.8 1.3 1.2
mistralai_mixtral_8x22b_instruct_v0.1 48.5 28.7 3 1.8 1.2 1.3
qwen2.5-coder-7b-instruct 46.7 27.6 4 1.8 1.2 1.3
google_gemma_3_4b_it 43.8 25.4 5 1.8 1.2 1.2
qwen1.5-72b-chat 43.6 25.2 2 1.8 1.1 1.4
google_gemma_2_9b_it 42.7 24.1 3 1.7 1.3 1.2
google_codegemma_1.1_7b_it 40.5 23.3 5 1.7 1.1 1.3
qwen3-4b 39 22.2 4 1.7 1.3 1.1
qwen2.5-coder-3b-instruct 38.9 22.3 4 1.7 1 1.4
qwen1.5-32b-chat 37.5 20.6 3 1.7 1.2 1.3
qwen3-1.7b 37.2 21 4 1.7 1.2 1.2
qwen2-math-72b-instruct 35.8 20.4 2 1.7 0.97 1.4
qwen1.5-14b-chat 35.4 19.4 3 1.7 1.2 1.2
deepseek_r1_distill_llama_70b 34.1 18.7 2 1.7 1.2 1.2
deepseek_r1_distill_qwen_14b 34.1 18.9 4 1.7 1.1 1.3
mistralai_ministral_8b_instruct_2410 32.2 17.2 4 1.7 1.1 1.3
mistralai_mathstral_7b_v0.1 32.2 17.2 4 1.7 1.1 1.2
qwen3-32b 31.3 17.1 3 1.6 1.2 1.1
qwen2-7b-instruct 31 16.6 4 1.6 1 1.3
qwen3-14b 29.7 15.9 3 1.6 1.3 0.91
llama-3.1-8B-instruct 27.6 14.9 7 1.6 1.6 0
google_gemma_7b_it 27.6 15.9 4 1.6 0.91 1.3
mistralai_mixtral_8x7b_instruct_v0.1 26.1 13.8 3 1.6 0.96 1.2
mistralai_mistral_7b_instruct_v0.3 25.6 13 4 1.5 1 1.1
qwen2.5-coder-1.5b-instruct 24 13.1 4 1.5 0.79 1.3
deepseek_r1_distill_llama_8b 23.7 12 4 1.5 1 1.1
qwen1.5-7b-chat 22.7 11.9 3 1.5 0.96 1.1
llama-3.2-3B-instruct 21.2 11.6 10 1.4 1.4 0
deepseek_v2_lite_chat 21.2 11 3 1.4 0.82 1.2
qwen3-8b 18.6 10 3 1.4 0.87 1.1
qwen2.5-coder-0.5b-instruct 18 10.4 5 1.4 0.62 1.2
mistralai_mistral_7b_instruct_v0.2 17.3 8.83 4 1.3 0.7 1.1
qwen3-0.6b 15.4 8.46 5 1.3 0.65 1.1
google_gemma_2b_it 15.3 9.39 4 1.3 0.67 1.1
mistralai_mistral_7b_instruct_v0.1 13.6 6.84 4 1.2 0.6 1.1
deepseek_r1_distill_qwen_7b 12 5.92 4 1.1 0.58 0.99
qwen2-math-7b-instruct 7.44 3.89 4 0.93 0.38 0.85
qwen2-1.5b-instruct 6.72 3.64 4 0.89 0.28 0.84
llama-3.2-1B-instruct 3.62 2.07 13 0.66 0.66 0
deepseek_r1_distill_qwen_32b 3.5 1.66 2 0.65 0.18 0.62
qwen2-0.5b-instruct 1.95 1.03 5 0.49 0.068 0.48
qwen2-math-1.5b-instruct 1.84 0.915 4 0.48 0.094 0.47
deepseek_r1_distill_qwen_1.5b 1.78 0.722 4 0.47 0.15 0.44
qwen1.5-1.8b-chat 1.04 0.616 3 0.36 0 0.36
qwen1.5-0.5b-chat 0.375 0.167 5 0.22 0 0.22
google_gemma_3_1b_it 0.0938 0.0484 4 0.11 0 0.11