Boosting Safety Alignment in LLMs with Response Shortcuts

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Safety, Alignment, Efficiency
Abstract: Despite the impressive general capabilities of LLMs like GPT and Llama, these models still require an alignment procedure to align their outputs with human preferences for helpful and safe responses. However, when users incorporate more helpfulness data to enhance model performance, the need for safety data often grows substantially due to the conflict between safety and helpfulness objectives in LLMs. This leads to significant additional costs in data collection and computation to ensure safety alignment. To address these challenges, we introduce a pre-defined shortcut with low-activated tokens on LLM weights, called response shortcuts, in the response part of safe training samples during the alignment stage. Response shortcuts enable LLMs to more effectively distinguish between helpful and safe scenarios, thereby significantly reducing the amount of safety data needed. Experiments show that response shortcuts achieve comparable safety performance with $20\times$ less safety samples in the alignment compared with models aligned under default settings, significantly reducing the resource cost during the data collection and training stage. Furthermore, response shortcuts also improve the model's helpfulness after alignment by mitigating the safety-helpfulness conflict, demonstrating its effectiveness as a practical and cost-efficient technique for LLM alignment. Our work brings new solutions for LLM's efficient alignment especially in resouce-contrained scenarios.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 11381
Loading