Keywords: Safety Alignment, Representation Steering, Context-adaptive
Abstract: Large language models (LLMs) face significant generative safety risks in deployment, and representation steering has emerged as a lightweight alternative to resource-intensive training-based safety alignment methods. However, existing representation steering approaches compute a unified steering direction, which fails to leverage context-specific information critical for precise safety alignment. To address this limitation, we propose \textit{CA-Steer}, a context-adaptive representation steering method for LLM safety alignment. It computes a context-adaptive direction by retrieving contextually similar safe and unsafe representations as references. Besides, a sample-level steering gate is introduced to filter unnecessary operations, ensuring safety alignment without compromising LLM utility. Evaluations on three safety benchmarks and two utility benchmarks show that CA-Steer significantly outperforms existing baselines: it improves the vanilla LLM’s average safety score from 85.80\% to 97.09\% (surpassing the best baseline by 6.28 percentage points), and maintains nearly no utility loss. In-depth analyses further confirm the rationality of its design and its acceptable overhead.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 11142
Loading