Robust Gradient Ascent for Backdoor Unlearning

ACL ARR 2025 February Submission2514 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Backdoor attacks pose a significant threat to machine learning models, allowing adversaries to implant hidden triggers that alter model behavior when activated. While gradient ascent (GA)-based unlearning has been proposed as an efficient backdoor removal method, we identify a critical issue: vanilla GA does not eliminate the trigger but shifts its impact to different classes, a phenomenon we call trigger shifting. To address this, we propose Robust Gradient Ascent (RGA), which introduces a dynamic penalty mechanism to regulate GA's strength and prevent excessive unlearning. Our experiments show that RGA effectively removes backdoors while preserving model utility, offering a more reliable defense against backdoor attacks.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: security and privacy, robustness
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2514
Loading