cruxeval_input_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen2.5-coder-32b-instruct 76.1 44.6 10 1.5 1.2 0.97
google_gemma_3_27b_it 69 38.7 9 1.6 1.3 1
qwen2.5-coder-14b-instruct 66.1 37 12 1.7 1.2 1.1
google_gemma_3_12b_it 62.5 33.7 11 1.7 1.3 1.1
llama-3.1-70B-instruct 61.4 32.9 13 1.7 1.7 0
google_gemma_2_27b_it 58 30.6 10 1.7 1.3 1.1
qwen2-72b-instruct 56.9 29.8 10 1.8 1.3 1.2
qwen2-math-72b-instruct 56.6 29.8 10 1.8 1.3 1.2
mistralai_mixtral_8x22b_instruct_v0.1 56.1 29.6 11 1.8 1.2 1.3
qwen2.5-coder-7b-instruct 54.9 28.8 11 1.8 1.3 1.2
google_gemma_2_9b_it 47.9 23.4 12 1.8 1.4 1.1
deepseek_r1_distill_qwen_14b 47 23.8 11 1.8 1.3 1.2
qwen1.5-72b-chat 47 23.3 10 1.8 1.3 1.2
deepseek_r1_distill_llama_70b 46.8 23.6 10 1.8 1.3 1.2
google_codegemma_1.1_7b_it 44.7 22.2 13 1.8 1.2 1.3
mistralai_mathstral_7b_v0.1 44 21.2 11 1.8 1.2 1.2
google_gemma_3_4b_it 43.7 21.6 13 1.8 1.3 1.2
mistralai_ministral_8b_instruct_2410 43.3 20.8 11 1.8 1.2 1.2
qwen2.5-coder-3b-instruct 43.3 21.7 12 1.8 1.1 1.3
llama-3.1-8B-instruct 41.9 20.5 15 1.7 1.7 0.052
qwen1.5-32b-chat 41.4 19.8 11 1.7 1.2 1.2
qwen2-7b-instruct 40.4 19 11 1.7 1.2 1.2
qwen1.5-14b-chat 39.6 18.7 12 1.7 1.2 1.2
qwen3-1.7b 38.6 18.4 12 1.7 1.2 1.2
qwen3-4b 37.9 18.1 12 1.7 1.3 1.1
deepseek_r1_distill_llama_8b 36 16.7 12 1.7 1.2 1.2
mistralai_mixtral_8x7b_instruct_v0.1 34.5 16.1 12 1.7 1.1 1.3
qwen2.5-coder-1.5b-instruct 34.3 16.9 12 1.7 1 1.3
qwen3-14b 32.3 14.9 12 1.7 1.4 0.89
llama-3.2-3B-instruct 32.1 15.4 18 1.7 1.7 0
qwen3-32b 31.9 14.9 11 1.6 1.3 0.98
mistralai_mistral_7b_instruct_v0.3 31.7 14 11 1.6 1.2 1.1
qwen2-math-7b-instruct 31.1 14.8 12 1.6 1.1 1.2
google_gemma_7b_it 29.7 14.6 12 1.6 1.1 1.2
qwen2.5-coder-0.5b-instruct 28.2 14.7 13 1.6 0.91 1.3
qwen1.5-7b-chat 27.2 12 12 1.6 1.1 1.2
deepseek_v2_lite_chat 26.5 12 11 1.6 1 1.2
mistralai_mistral_7b_instruct_v0.1 26.3 12 11 1.6 1 1.2
mistralai_mistral_7b_instruct_v0.2 24.7 10.7 11 1.5 0.98 1.2
qwen3-0.6b 22.7 10.6 13 1.5 0.9 1.2
qwen3-8b 21.2 9.66 12 1.4 0.99 1.1
google_gemma_2b_it 18.7 9.76 13 1.4 0.85 1.1
deepseek_r1_distill_qwen_7b 16.9 7.15 11 1.3 0.74 1.1
qwen2-math-1.5b-instruct 16.9 7.28 12 1.3 0.83 1
qwen2-1.5b-instruct 15.1 7.04 12 1.3 0.58 1.1
llama-3.2-1B-instruct 8.88 4.3 21 1 1 0
deepseek_r1_distill_qwen_1.5b 6.12 2.09 12 0.85 0.42 0.74
qwen2-0.5b-instruct 4.86 2.18 13 0.76 0.27 0.71
qwen1.5-1.8b-chat 3.06 1.42 12 0.61 0.15 0.59
qwen1.5-0.5b-chat 1.73 0.867 13 0.46 0.11 0.45
deepseek_r1_distill_qwen_32b 1.56 0.564 10 0.44 0.13 0.42
google_gemma_3_1b_it 0.0865 0.0354 13 0.1 0.055 0.088