cruxeval_output_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen2.5-coder-32b-instruct 82.4 44.4 10 1.3 1.1 0.72
qwen3-14b 78.7 41.3 12 1.4 1.2 0.87
qwen2.5-coder-14b-instruct 76.5 39.5 12 1.5 1.3 0.79
google_gemma_3_27b_it 76 38.9 9 1.5 1.3 0.77
llama-3.1-70B-instruct 70.2 34.6 13 1.6 1.6 0
google_gemma_3_12b_it 69.7 34.3 11 1.6 1.4 0.85
qwen3-32b 69.5 34.4 11 1.6 1.4 0.89
deepseek_r1_distill_llama_70b 66.5 32.3 10 1.7 1.3 1.1
qwen2-72b-instruct 64.1 30.5 10 1.7 1.3 1.1
google_gemma_2_27b_it 62.7 29.2 10 1.7 1.4 0.97
mistralai_mixtral_8x22b_instruct_v0.1 62.3 28.8 11 1.7 1.4 1
qwen2-math-72b-instruct 59.5 27.3 10 1.7 1.3 1.1
qwen3-4b 57.4 26.1 12 1.7 1.4 1
qwen2.5-coder-7b-instruct 56.2 25.1 11 1.8 1.4 1.1
google_gemma_3_4b_it 54.5 24 13 1.8 1.5 0.96
deepseek_r1_distill_qwen_14b 54.5 26.7 11 1.8 1 1.4
deepseek_r1_distill_llama_8b 52.3 22.8 12 1.8 1.3 1.2
qwen2.5-coder-3b-instruct 50.1 21.3 12 1.8 1.3 1.2
google_gemma_2_9b_it 48.6 19.9 12 1.8 1.5 0.98
qwen1.5-32b-chat 48.3 20.1 11 1.8 1.4 1.1
qwen1.5-72b-chat 46.6 19.6 10 1.8 1.3 1.2
mistralai_mixtral_8x7b_instruct_v0.1 45.7 18.5 11 1.8 1.3 1.1
llama-3.1-8B-instruct 44.2 17.9 15 1.8 1.8 0
mistralai_ministral_8b_instruct_2410 44.2 17.7 11 1.8 1.3 1.2
mistralai_mathstral_7b_v0.1 40.9 15.8 11 1.7 1.3 1.1
qwen2-7b-instruct 39.3 15.1 11 1.7 1.3 1.1
google_codegemma_1.1_7b_it 38.2 14.8 13 1.7 1.3 1.1
qwen1.5-14b-chat 37.8 14.3 12 1.7 1.3 1.1
mistralai_mistral_7b_instruct_v0.3 34.2 12.4 11 1.7 1.3 1.1
qwen3-1.7b 33.9 14.8 12 1.7 0.94 1.4
qwen3-0.6b 31.4 12 13 1.6 1.1 1.2
deepseek_r1_distill_qwen_32b 30.5 13.4 10 1.6 0.83 1.4
llama-3.2-3B-instruct 30.1 10.5 18 1.6 1.6 0
qwen3-8b 29.5 11.3 12 1.6 1.1 1.2
qwen2.5-coder-1.5b-instruct 29 10.7 12 1.6 1 1.2
deepseek_v2_lite_chat 28.1 9.99 11 1.6 1.1 1.1
google_gemma_3_1b_it 27.5 10.3 12 1.6 1.2 1
qwen2-math-7b-instruct 26.3 9.53 12 1.6 1 1.2
qwen1.5-7b-chat 26.2 9.23 12 1.6 1.1 1.1
google_gemma_7b_it 25.5 9.26 12 1.5 1.2 0.95
mistralai_mistral_7b_instruct_v0.1 23.6 8.26 11 1.5 1 1.1
mistralai_mistral_7b_instruct_v0.2 22.7 8.24 11 1.5 0.87 1.2
qwen2-math-1.5b-instruct 21.6 7.64 12 1.5 1 1
qwen2.5-coder-0.5b-instruct 20.5 7.57 13 1.4 0.9 1.1
deepseek_r1_distill_qwen_7b 19.3 7.45 11 1.4 0.71 1.2
deepseek_r1_distill_qwen_1.5b 14.1 4.94 12 1.2 0.59 1.1
google_gemma_2b_it 12.5 4.61 13 1.2 0.73 0.92
qwen2-1.5b-instruct 12.5 4.12 12 1.2 0.64 0.98
llama-3.2-1B-instruct 11.4 3.77 21 1.1 1.1 0
qwen1.5-0.5b-chat 6.23 2.32 13 0.85 0.43 0.74
qwen2-0.5b-instruct 1.84 0.666 13 0.47 0.11 0.46
qwen1.5-1.8b-chat 1.17 0.378 12 0.38 0.1 0.37