Correcting with Low Rank, Defending Against All: TurboLoRA for Robust LLM Safety Alignment

ACL ARR 2024 December Submission1801 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In recent years, Large Language Models (LLMs) have expanded their applications across various fields but have also faced security challenges. Current alignment methods only take effective against certain jailbreak attacks but failing to defend against others, leaving significant vulnerabilities against diverse and evolving attack strategies. To address the limitations of existing adversarial alignment methods with defense blind spots, which can be easily breached by specific jailbreak attack techniques, we propose TurboLoRA, the first comprehensive adversarial safety alignment method. TurboLoRA intrinsic correct harmful response to safety response by modification low-rank transformation parameters, which shift the short-range vector disparity between hidden vectors corresponding to safety and harmful response. TurboLoRA enables efficient, comprehensive adversarially robust safety alignment without affecting downstream tasks. Lastly we verified the effectiveness of our approach by conducting extensive experiments with various jailbreak methods and target LLMs.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/unfairness mitigation, safety alignment
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1801
Loading