Safety Depth in Large Language Models: A Markov Chain Perspective

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety Alignment, Large Language Models, Markov Chain, Group Theory, Ensemble
Abstract: Large Language Models (LLMs) are increasingly adopted in high-stakes scenarios, yet their safety mechanisms often remain fragile. Simple jailbreak prompts or even benign fine-tuning can bypass internal safeguards, underscoring the need to understand the failure modes of current safety strategies. Recent findings suggest that vulnerabilities emerge when alignment is confined to only the initial output tokens. To address this, we introduce the notion of safety depth, a designated output position where the model refuses to generate harmful content. While deeper alignment appears promising, identifying the optimal safety depth remains an open and underexplored challenge. We leverage the equivalence between autoregressive language models and Markov chains to derive the first theoretical result on identifying the optimal safety depth. To reach this safety depth effectively, we propose a cyclic group augmentation strategy that improves safety scores across six LLMs. In addition, we uncover a critical interaction between safety depth and ensemble width, demonstrating that larger ensembles can offset shallower alignments. These results suggest that test-time computation, often overlooked in safety alignment, can play a key role. Our approach provides actionable insights for building safer LLMs.
Supplementary Material: zip
Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Submission Number: 5493
Loading