BarrierSteer: LLM Safety via Learning Barrier Steering
Keywords: LLM Safety, Inference-time Safety, Activation Steering, Control Barrier Functions
TL;DR: This paper introduces BarrierSteer, a framework that improves safety in large language models by steering their latent representations with control barrier functions during inference, reducing adversarial and unsafe outputs.
Abstract: Despite the strong performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a significant barrier to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and theoretically grounded. In this paper, we introduce BarrierSteer, a novel framework that improves response safety by embedding learned nonlinear safety constraints directly into the model's latent representation space. BarrierSteer treats hidden-state safety classifiers as Control Barrier Functions (CBFs), enabling constraint-guided steering of unsafe latent trajectories during generation. By composing multiple safety constraints through efficient constraint merging without modifying the underlying LLM parameters, BarrierSteer preserves model utility and performance. We provide theoretical results showing that applying CBFs in latent space yields a principled and computationally efficient approach for steering with respect to learned safety constraints, with guarantees conditional on the learned barriers capturing the intended safety property. Extensive experiments across multiple models and datasets demonstrate that BarrierSteer substantially reduces adversarial attack success rates and unsafe generations, outperforming existing methods.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 221
Loading