Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Jianwei Li; Jung-Eun Kim

Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Jianwei Li, Jung-Eun Kim

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-ND 4.0

Abstract: Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safety-related reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity and allow models to respond more responsibly to malicious queries. We emphasize that, with less than 0.2x overhead cost, our approach enables LLMs to assess the safety of both the query and the previously generated tokens at each necessary generating step. Extensive experiments demonstrate that our method significantly improves the resilience of LLMs against various adversarial attacks, offering a promising pathway toward more robust generative AI systems.

Lay Summary: Large language models (LLMs) like ChatGPT are becoming widely used, but they can sometimes respond to harmful or malicious requests — even if they’ve been trained to be “safe.” Current training methods often make the model appear safe on the surface, but these protections can break down when people craft tricky or indirect prompts to bypass safety. Our research introduces a new way to make LLMs more robust by teaching them to recognize unsafe content directly. We add a special signal inside the model that acts like an internal safety monitor, helping it detect and avoid harmful behavior not just at the start of a response, but throughout the entire generation process. This approach is simple to train, easy to apply after standard safety alignment methods, and adds minimal cost. It could help build AI systems that are safer and more trustworthy, even when users try to trick them.

Link To Code: https://sa-ess.github.io/

Primary Area: Social Aspects->Alignment

Keywords: Large Language Model, LLM, Safety Alignment

Submission Number: 807

Loading