Abstract: The deployment of large language models (LLMs) in real-world applications is hindered by persistent vulnerabilities in safety alignment, where existing methods remain susceptible to jailbreak attacks and alignment collapse after fine-tuning. We observed that this vulnerability has two key sources: 1) shallow alignment: alignment training primarily adjusts shallow top-layer parameters while neglecting deeper layers, and 2) the scarcity of safety key neurons and their high overlap with general key neurons. To address these challenges, we propose RobustAlign, which enhances alignment depth and breadth to achieve robust safety alignment through two synergistic innovations: (1) Chain-of-Thought (CoT)-augmented training data, which increases the information entropy of training samples, and (2) Synergistic Gradient Scaling to promote deeper and broader adjustments. Extensive experiments on five LLMs against six jailbreak attacks demonstrate RobustAlign’s superiority: it reduces attack success rates (ASR) by 21\%–63\% compared to state-of-the-art baselines against jailbreak attacks and subsequent fine-tuning, while preserving downstream task accuracy and introducing minimal computational overhead (<3\%).
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/unfairness mitigation, safety and alignment
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 2739
Loading