Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

Published: 11 Jun 2025, Last Modified: 11 Jun 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Unlearning, Concept Erasure, Model Poisoning, Backdoor Attack, Safety
TL;DR: We introduce a new threat model, Toxic Erasure (ToxE), showing that diffusion model backdoor attacks can bypass current concept erasure techniques.
Abstract: Large-scale text-to-image diffusion models pose risks of generating harmful content, including explicit imagery and fake depictions. While unlearning methods aim to remove such capabilities, we introduce a new threat model, Toxic Erasure (ToxE), showing that current erasure techniques can be bypassed via backdoor attacks. These attacks link a trigger to unwanted content, which persists despite unlearning. We demonstrate this through attacks on text encoders, cross-attention layers, and propose a deeper method, DISA, which manipulates the U-Net using a score-based loss. Across six erasure methods, DISA achieves up to 82% success in bypassing identity removal, 66% average success against object erasure and nearly triples explicit content exposure post-erasure. Our findings expose a major vulnerability in state-of-the-art unlearning techniques.
Submission Number: 7
Loading