Robust Gradient Ascent for Backdoor Unlearning

Robust Gradient Ascent for Backdoor Unlearning

ACL ARR 2025 February Submission2514 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Backdoor attacks pose a significant threat to machine learning models, allowing adversaries to implant hidden triggers that alter model behavior when activated. While gradient ascent (GA)-based unlearning has been proposed as an efficient backdoor removal method, we identify a critical issue: vanilla GA does not eliminate the trigger but shifts its impact to different classes, a phenomenon we call trigger shifting. To address this, we propose Robust Gradient Ascent (RGA), which introduces a dynamic penalty mechanism to regulate GA's strength and prevent excessive unlearning. Our experiments show that RGA effectively removes backdoors while preserving model utility, offering a more reliable defense against backdoor attacks.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: security and privacy, robustness

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 2514

Loading