cruxeval_input_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen2.5-coder-32b-instruct	76.1	44.6	10	1.5	1.2	0.97
google_gemma_3_27b_it	69	38.7	9	1.6	1.3	1
qwen2.5-coder-14b-instruct	66.1	37	12	1.7	1.2	1.1
google_gemma_3_12b_it	62.5	33.7	11	1.7	1.3	1.1
llama-3.1-70B-instruct	61.4	32.9	13	1.7	1.7	0
google_gemma_2_27b_it	58	30.6	10	1.7	1.3	1.1
qwen2-72b-instruct	56.9	29.8	10	1.8	1.3	1.2
qwen2-math-72b-instruct	56.6	29.8	10	1.8	1.3	1.2
mistralai_mixtral_8x22b_instruct_v0.1	56.1	29.6	11	1.8	1.2	1.3
qwen2.5-coder-7b-instruct	54.9	28.8	11	1.8	1.3	1.2
google_gemma_2_9b_it	47.9	23.4	12	1.8	1.4	1.1
deepseek_r1_distill_qwen_14b	47	23.8	11	1.8	1.3	1.2
qwen1.5-72b-chat	47	23.3	10	1.8	1.3	1.2
deepseek_r1_distill_llama_70b	46.8	23.6	10	1.8	1.3	1.2
google_codegemma_1.1_7b_it	44.7	22.2	13	1.8	1.2	1.3
mistralai_mathstral_7b_v0.1	44	21.2	11	1.8	1.2	1.2
google_gemma_3_4b_it	43.7	21.6	13	1.8	1.3	1.2
mistralai_ministral_8b_instruct_2410	43.3	20.8	11	1.8	1.2	1.2
qwen2.5-coder-3b-instruct	43.3	21.7	12	1.8	1.1	1.3
llama-3.1-8B-instruct	41.9	20.5	15	1.7	1.7	0.052
qwen1.5-32b-chat	41.4	19.8	11	1.7	1.2	1.2
qwen2-7b-instruct	40.4	19	11	1.7	1.2	1.2
qwen1.5-14b-chat	39.6	18.7	12	1.7	1.2	1.2
qwen3-1.7b	38.6	18.4	12	1.7	1.2	1.2
qwen3-4b	37.9	18.1	12	1.7	1.3	1.1
deepseek_r1_distill_llama_8b	36	16.7	12	1.7	1.2	1.2
mistralai_mixtral_8x7b_instruct_v0.1	34.5	16.1	12	1.7	1.1	1.3
qwen2.5-coder-1.5b-instruct	34.3	16.9	12	1.7	1	1.3
qwen3-14b	32.3	14.9	12	1.7	1.4	0.89
llama-3.2-3B-instruct	32.1	15.4	18	1.7	1.7	0
qwen3-32b	31.9	14.9	11	1.6	1.3	0.98
mistralai_mistral_7b_instruct_v0.3	31.7	14	11	1.6	1.2	1.1
qwen2-math-7b-instruct	31.1	14.8	12	1.6	1.1	1.2
google_gemma_7b_it	29.7	14.6	12	1.6	1.1	1.2
qwen2.5-coder-0.5b-instruct	28.2	14.7	13	1.6	0.91	1.3
qwen1.5-7b-chat	27.2	12	12	1.6	1.1	1.2
deepseek_v2_lite_chat	26.5	12	11	1.6	1	1.2
mistralai_mistral_7b_instruct_v0.1	26.3	12	11	1.6	1	1.2
mistralai_mistral_7b_instruct_v0.2	24.7	10.7	11	1.5	0.98	1.2
qwen3-0.6b	22.7	10.6	13	1.5	0.9	1.2
qwen3-8b	21.2	9.66	12	1.4	0.99	1.1
google_gemma_2b_it	18.7	9.76	13	1.4	0.85	1.1
deepseek_r1_distill_qwen_7b	16.9	7.15	11	1.3	0.74	1.1
qwen2-math-1.5b-instruct	16.9	7.28	12	1.3	0.83	1
qwen2-1.5b-instruct	15.1	7.04	12	1.3	0.58	1.1
llama-3.2-1B-instruct	8.88	4.3	21	1	1	0
deepseek_r1_distill_qwen_1.5b	6.12	2.09	12	0.85	0.42	0.74
qwen2-0.5b-instruct	4.86	2.18	13	0.76	0.27	0.71
qwen1.5-1.8b-chat	3.06	1.42	12	0.61	0.15	0.59
qwen1.5-0.5b-chat	1.73	0.867	13	0.46	0.11	0.45
deepseek_r1_distill_qwen_32b	1.56	0.564	10	0.44	0.13	0.42
google_gemma_3_1b_it	0.0865	0.0354	13	0.1	0.055	0.088