cruxeval_input_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen2.5-coder-32b-instruct	74.2	48.8	2	1.5	1.2	1
google_gemma_3_27b_it	68.9	44.1	3	1.6	1.3	1
qwen2.5-coder-14b-instruct	62.7	39.7	3	1.7	1.1	1.3
google_gemma_3_12b_it	61.3	37.7	4	1.7	1.3	1.1
qwen2-72b-instruct	53.8	32.2	2	1.8	1.2	1.3
llama-3.1-70B-instruct	53.1	32.1	4	1.8	1.8	0
google_gemma_2_27b_it	52.8	31.4	2	1.8	1.3	1.2
mistralai_mixtral_8x22b_instruct_v0.1	48.5	28.7	3	1.8	1.2	1.3
qwen2.5-coder-7b-instruct	46.7	27.6	4	1.8	1.2	1.3
google_gemma_3_4b_it	43.8	25.4	5	1.8	1.2	1.2
qwen1.5-72b-chat	43.6	25.2	2	1.8	1.1	1.4
google_gemma_2_9b_it	42.7	24.1	3	1.7	1.3	1.2
google_codegemma_1.1_7b_it	40.5	23.3	5	1.7	1.1	1.3
qwen3-4b	39	22.2	4	1.7	1.3	1.1
qwen2.5-coder-3b-instruct	38.9	22.3	4	1.7	1	1.4
qwen1.5-32b-chat	37.5	20.6	3	1.7	1.2	1.3
qwen3-1.7b	37.2	21	4	1.7	1.2	1.2
qwen2-math-72b-instruct	35.8	20.4	2	1.7	0.97	1.4
qwen1.5-14b-chat	35.4	19.4	3	1.7	1.2	1.2
deepseek_r1_distill_llama_70b	34.1	18.7	2	1.7	1.2	1.2
deepseek_r1_distill_qwen_14b	34.1	18.9	4	1.7	1.1	1.3
mistralai_ministral_8b_instruct_2410	32.2	17.2	4	1.7	1.1	1.3
mistralai_mathstral_7b_v0.1	32.2	17.2	4	1.7	1.1	1.2
qwen3-32b	31.3	17.1	3	1.6	1.2	1.1
qwen2-7b-instruct	31	16.6	4	1.6	1	1.3
qwen3-14b	29.7	15.9	3	1.6	1.3	0.91
llama-3.1-8B-instruct	27.6	14.9	7	1.6	1.6	0
google_gemma_7b_it	27.6	15.9	4	1.6	0.91	1.3
mistralai_mixtral_8x7b_instruct_v0.1	26.1	13.8	3	1.6	0.96	1.2
mistralai_mistral_7b_instruct_v0.3	25.6	13	4	1.5	1	1.1
qwen2.5-coder-1.5b-instruct	24	13.1	4	1.5	0.79	1.3
deepseek_r1_distill_llama_8b	23.7	12	4	1.5	1	1.1
qwen1.5-7b-chat	22.7	11.9	3	1.5	0.96	1.1
llama-3.2-3B-instruct	21.2	11.6	10	1.4	1.4	0
deepseek_v2_lite_chat	21.2	11	3	1.4	0.82	1.2
qwen3-8b	18.6	10	3	1.4	0.87	1.1
qwen2.5-coder-0.5b-instruct	18	10.4	5	1.4	0.62	1.2
mistralai_mistral_7b_instruct_v0.2	17.3	8.83	4	1.3	0.7	1.1
qwen3-0.6b	15.4	8.46	5	1.3	0.65	1.1
google_gemma_2b_it	15.3	9.39	4	1.3	0.67	1.1
mistralai_mistral_7b_instruct_v0.1	13.6	6.84	4	1.2	0.6	1.1
deepseek_r1_distill_qwen_7b	12	5.92	4	1.1	0.58	0.99
qwen2-math-7b-instruct	7.44	3.89	4	0.93	0.38	0.85
qwen2-1.5b-instruct	6.72	3.64	4	0.89	0.28	0.84
llama-3.2-1B-instruct	3.62	2.07	13	0.66	0.66	0
deepseek_r1_distill_qwen_32b	3.5	1.66	2	0.65	0.18	0.62
qwen2-0.5b-instruct	1.95	1.03	5	0.49	0.068	0.48
qwen2-math-1.5b-instruct	1.84	0.915	4	0.48	0.094	0.47
deepseek_r1_distill_qwen_1.5b	1.78	0.722	4	0.47	0.15	0.44
qwen1.5-1.8b-chat	1.04	0.616	3	0.36	0	0.36
qwen1.5-0.5b-chat	0.375	0.167	5	0.22	0	0.22
google_gemma_3_1b_it	0.0938	0.0484	4	0.11	0	0.11