cruxeval_output_cot: by models

Home Paper Code


SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
qwen2.5-coder-32b-instruct 81.5 48.5 2 1.4 1.1 0.83
qwen2.5-coder-14b-instruct 76.2 43.9 3 1.5 1.2 0.92
qwen3-14b 76.2 44 3 1.5 1.2 0.92
google_gemma_3_27b_it 75.9 43.6 3 1.5 1.3 0.82
google_gemma_3_12b_it 69.4 38.6 4 1.6 1.3 0.92
qwen3-32b 66.9 37.1 2 1.7 1.3 0.98
llama-3.1-70B-instruct 65.4 35.4 4 1.7 1.7 0
qwen2-72b-instruct 60.4 32.1 2 1.7 1.2 1.2
google_gemma_2_27b_it 59.9 31.6 2 1.7 1.4 1.1
mistralai_mixtral_8x22b_instruct_v0.1 57.2 29.5 2 1.7 1.3 1.1
qwen3-4b 56.9 29.6 4 1.8 1.4 1.1
deepseek_r1_distill_llama_70b 55.4 29 2 1.8 1.3 1.2
google_gemma_3_4b_it 54.9 28 5 1.8 1.4 1
qwen2.5-coder-7b-instruct 50.9 25.4 4 1.8 1.4 1.1
google_gemma_2_9b_it 46.9 22.5 3 1.8 1.4 1
deepseek_r1_distill_qwen_14b 45.9 24.1 4 1.8 1 1.5
qwen1.5-72b-chat 45.9 22.3 2 1.8 1.3 1.2
qwen1.5-32b-chat 44.9 21.5 2 1.8 1.3 1.2
qwen2.5-coder-3b-instruct 44.5 21.3 4 1.8 1.3 1.2
mistralai_mixtral_8x7b_instruct_v0.1 41.4 19.6 3 1.7 1.2 1.2
deepseek_r1_distill_llama_8b 39.3 18.9 4 1.7 1.2 1.3
mistralai_ministral_8b_instruct_2410 39.2 17.8 4 1.7 1.2 1.2
google_codegemma_1.1_7b_it 36.2 16.5 5 1.7 1.3 1.1
llama-3.1-8B-instruct 34.6 16 7 1.7 1.7 0
mistralai_mathstral_7b_v0.1 34.4 15.4 4 1.7 1.2 1.2
qwen1.5-14b-chat 33.7 14.9 3 1.7 1.2 1.2
qwen2-math-72b-instruct 33.2 16.2 2 1.7 0.92 1.4
qwen2-7b-instruct 32.4 14.4 4 1.7 1.1 1.2
qwen3-1.7b 31.5 15.6 4 1.6 0.88 1.4
mistralai_mistral_7b_instruct_v0.3 31.3 13.7 4 1.6 1.2 1.2
deepseek_r1_distill_qwen_32b 27.7 13.4 2 1.6 0.78 1.4
google_gemma_3_1b_it 26.8 12 4 1.6 1.2 1.1
qwen3-8b 25.9 11.6 3 1.5 0.99 1.2
qwen3-0.6b 25 11.2 5 1.5 0.92 1.2
qwen1.5-7b-chat 24.8 10.6 3 1.5 1 1.1
google_gemma_7b_it 24.7 10.8 4 1.5 1.1 1
deepseek_v2_lite_chat 24.4 10.5 3 1.5 0.97 1.2
llama-3.2-3B-instruct 20.6 8.53 10 1.4 1.4 0
qwen2.5-coder-1.5b-instruct 19.9 8.58 4 1.4 0.75 1.2
mistralai_mistral_7b_instruct_v0.1 17.9 7.62 4 1.4 0.78 1.1
qwen2.5-coder-0.5b-instruct 16.8 7.28 5 1.3 0.75 1.1
mistralai_mistral_7b_instruct_v0.2 16.1 7.01 4 1.3 0.63 1.1
deepseek_r1_distill_qwen_7b 12.2 5.32 4 1.2 0.45 1.1
google_gemma_2b_it 9.59 4.29 4 1 0.57 0.87
qwen2-math-1.5b-instruct 8.41 3.45 4 0.98 0.39 0.9
llama-3.2-1B-instruct 7.62 3.05 13 0.94 0.94 0
qwen2-1.5b-instruct 5.75 2.29 4 0.82 0.35 0.74
deepseek_r1_distill_qwen_1.5b 5.59 2.26 4 0.81 0.22 0.78
qwen2-math-7b-instruct 5.5 2.34 4 0.81 0.21 0.78
qwen1.5-0.5b-chat 1.73 0.798 5 0.46 0.13 0.44
qwen2-0.5b-instruct 0.525 0.264 5 0.26 0 0.26
qwen1.5-1.8b-chat 0.375 0.171 3 0.22 0.071 0.2