Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

08 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Unlearning, Concept Erasure, Model Poisoning, Backdoor Attack, Safety
TL;DR: We expose a blind spot in current erasure methods, and demonstrate how this threat (Toxic Erasure) can be realized through known weight-based and data-poisoning attacks, and further introduce a novel Deep Intervention Score-based Attack (DISA).
Abstract: The expansion of large-scale text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to sexually explicit imagery. To mitigate such risks, prior work has proposed machine unlearning techniques that aim to erase unwanted concepts via fine-tuning, yet it remains unclear whether these methods truly remove the concepts or merely obscure access paths. In this work, we reveal a critical, unexplored vulnerability, Toxic Erasure (ToxE): an adversary binds a backdoor trigger to a concept slated for removal, and this malicious link survives subsequent unlearning, allowing the regeneration of supposedly removed content. We show how this threat can be realized through known weight-based and data-poisoning backdoors and further introduce a novel, highly effective instance, the Deep Intervention Score-based Attack (ToxE-DISA), which optimizes a score-based objective to embed the malicious link deeply within the diffusion process. Across six state-of-the-art erasure methods, DISA consistently restores erased content: up to 82\% success (57\% average) against celebrity-identity unlearning, up to 94\% (65\% average) for object erasure, and up to 16$\times$ (7$\times$ average) amplification of explicit-content exposure. While ToxE exposes a blind spot in current erasure methods, it also provides a diagnostic tool for stress-testing future defenses, helping to design more resilient unlearning strategies.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 3038
Loading