On Almost Surely Safe Alignment of Large Language Models at Inference-Time

27 Oct 2025 (modified: 12 May 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce a novel inference-time alignment approach for LLMs that aims to generate safe responses almost surely, i.e., with probability approaching one w.r.t. a given cost model. Our approach models the generation of safe responses as a constrained Markov Decision Process (MDP) within the LLM's latent space. We augment a safety state that tracks the evolution of safety constraints and dynamically penalize unsafe generations to ensure the generation of safe responses. Consequently, we demonstrate formal safety guarantees w.r.t. the given cost model upon solving the MDP in the latent space with sufficiently large penalties. Building on this foundation, we propose $\texttt{InferenceGuard}$, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate that $\texttt{InferenceGuard}$ effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses. Our findings contribute to the advancement of safer LLM deployment through alignment at inference-time, thus presenting a promising alternative to resource-intensive, overfitting-prone alignment techniques like RLHF.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We revised the paper in response to the reviewers’ feedback and highlighted the modifications in blue. In the introduction and related work, we clarified how our method is positioned relative to training-time alignment and made explicit that our guarantee is with respect to the safety cost model. In the method section, we expanded the discussion of Lagrangian methods to clarify that our concern is specific to the non-convex, inference-time LLM setting, and we refined the theorem statements, including clarifying the role of the injectivity assumption and moving the proof from the appendix to the main text. In the evaluation section, we clarified the baseline design of augmented BoN and beam search, highlighted Beaver-v3 as the Safe-RLHF training-time alignment baseline, and expanded the discussion of when InferenceGuard is particularly beneficial. We provided additional experiments in the appendix, including larger-model validation on Llama-2-70B-Chat, adversarial prompt evaluations on AdvBench, MaliciousInstruct, and HarmBench, practical hyperparameter guidance, and an analysis of robustness across different cost models. We also strengthened the limitations section with a more explicit discussion of the bounded-cost requirement and the reliance on the quality of the safety cost model.
Assigned Action Editor: ~Lingpeng_Kong1
Submission Number: 6322
Loading