RobustAlign: A Deeper and Broader Layer Safety Alignment for LLMs

RobustAlign: A Deeper and Broader Layer Safety Alignment for LLMs

ACL ARR 2025 May Submission2739 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The deployment of large language models (LLMs) in real-world applications is hindered by persistent vulnerabilities in safety alignment, where existing methods remain susceptible to jailbreak attacks and alignment collapse after fine-tuning. We observed that this vulnerability has two key sources: 1) shallow alignment: alignment training primarily adjusts shallow top-layer parameters while neglecting deeper layers, and 2) the scarcity of safety key neurons and their high overlap with general key neurons. To address these challenges, we propose RobustAlign, which enhances alignment depth and breadth to achieve robust safety alignment through two synergistic innovations: (1) Chain-of-Thought (CoT)-augmented training data, which increases the information entropy of training samples, and (2) Synergistic Gradient Scaling to promote deeper and broader adjustments. Extensive experiments on five LLMs against six jailbreak attacks demonstrate RobustAlign’s superiority: it reduces attack success rates (ASR) by 21\%–63\% compared to state-of-the-art baselines against jailbreak attacks and subsequent fine-tuning, while preserving downstream task accuracy and introducing minimal computational overhead (<3\%).

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/unfairness mitigation, safety and alignment

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 2739

Loading