cruxeval_output_cot: by models

Home Paper Code

SE predicted by accuracy

The typical standard errors between pairs of models on this dataset as a function of the absolute accuracy.

CDF of question level accuracy

Results table by model

model	pass1	win_rate	count	SE(A)	SE_x(A)	SE_pred(A)
qwen2.5-coder-32b-instruct	81.5	48.5	2	1.4	1.1	0.83
qwen2.5-coder-14b-instruct	76.2	43.9	3	1.5	1.2	0.92
qwen3-14b	76.2	44	3	1.5	1.2	0.92
google_gemma_3_27b_it	75.9	43.6	3	1.5	1.3	0.82
google_gemma_3_12b_it	69.4	38.6	4	1.6	1.3	0.92
qwen3-32b	66.9	37.1	2	1.7	1.3	0.98
llama-3.1-70B-instruct	65.4	35.4	4	1.7	1.7	0
qwen2-72b-instruct	60.4	32.1	2	1.7	1.2	1.2
google_gemma_2_27b_it	59.9	31.6	2	1.7	1.4	1.1
mistralai_mixtral_8x22b_instruct_v0.1	57.2	29.5	2	1.7	1.3	1.1
qwen3-4b	56.9	29.6	4	1.8	1.4	1.1
deepseek_r1_distill_llama_70b	55.4	29	2	1.8	1.3	1.2
google_gemma_3_4b_it	54.9	28	5	1.8	1.4	1
qwen2.5-coder-7b-instruct	50.9	25.4	4	1.8	1.4	1.1
google_gemma_2_9b_it	46.9	22.5	3	1.8	1.4	1
deepseek_r1_distill_qwen_14b	45.9	24.1	4	1.8	1	1.5
qwen1.5-72b-chat	45.9	22.3	2	1.8	1.3	1.2
qwen1.5-32b-chat	44.9	21.5	2	1.8	1.3	1.2
qwen2.5-coder-3b-instruct	44.5	21.3	4	1.8	1.3	1.2
mistralai_mixtral_8x7b_instruct_v0.1	41.4	19.6	3	1.7	1.2	1.2
deepseek_r1_distill_llama_8b	39.3	18.9	4	1.7	1.2	1.3
mistralai_ministral_8b_instruct_2410	39.2	17.8	4	1.7	1.2	1.2
google_codegemma_1.1_7b_it	36.2	16.5	5	1.7	1.3	1.1
llama-3.1-8B-instruct	34.6	16	7	1.7	1.7	0
mistralai_mathstral_7b_v0.1	34.4	15.4	4	1.7	1.2	1.2
qwen1.5-14b-chat	33.7	14.9	3	1.7	1.2	1.2
qwen2-math-72b-instruct	33.2	16.2	2	1.7	0.92	1.4
qwen2-7b-instruct	32.4	14.4	4	1.7	1.1	1.2
qwen3-1.7b	31.5	15.6	4	1.6	0.88	1.4
mistralai_mistral_7b_instruct_v0.3	31.3	13.7	4	1.6	1.2	1.2
deepseek_r1_distill_qwen_32b	27.7	13.4	2	1.6	0.78	1.4
google_gemma_3_1b_it	26.8	12	4	1.6	1.2	1.1
qwen3-8b	25.9	11.6	3	1.5	0.99	1.2
qwen3-0.6b	25	11.2	5	1.5	0.92	1.2
qwen1.5-7b-chat	24.8	10.6	3	1.5	1	1.1
google_gemma_7b_it	24.7	10.8	4	1.5	1.1	1
deepseek_v2_lite_chat	24.4	10.5	3	1.5	0.97	1.2
llama-3.2-3B-instruct	20.6	8.53	10	1.4	1.4	0
qwen2.5-coder-1.5b-instruct	19.9	8.58	4	1.4	0.75	1.2
mistralai_mistral_7b_instruct_v0.1	17.9	7.62	4	1.4	0.78	1.1
qwen2.5-coder-0.5b-instruct	16.8	7.28	5	1.3	0.75	1.1
mistralai_mistral_7b_instruct_v0.2	16.1	7.01	4	1.3	0.63	1.1
deepseek_r1_distill_qwen_7b	12.2	5.32	4	1.2	0.45	1.1
google_gemma_2b_it	9.59	4.29	4	1	0.57	0.87
qwen2-math-1.5b-instruct	8.41	3.45	4	0.98	0.39	0.9
llama-3.2-1B-instruct	7.62	3.05	13	0.94	0.94	0
qwen2-1.5b-instruct	5.75	2.29	4	0.82	0.35	0.74
deepseek_r1_distill_qwen_1.5b	5.59	2.26	4	0.81	0.22	0.78
qwen2-math-7b-instruct	5.5	2.34	4	0.81	0.21	0.78
qwen1.5-0.5b-chat	1.73	0.798	5	0.46	0.13	0.44
qwen2-0.5b-instruct	0.525	0.264	5	0.26	0	0.26
qwen1.5-1.8b-chat	0.375	0.171	3	0.22	0.071	0.2