Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

Jonas Henry Grebe; Tobias Braun; Marcus Rohrbach; Anna Rohrbach

Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

Jonas Henry Grebe, Tobias Braun, Marcus Rohrbach, Anna Rohrbach

Published: 11 Jun 2025, Last Modified: 01 Jul 2025MUGen @ ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine Unlearning, Concept Erasure, Model Poisoning, Backdoor Attack, Safety

TL;DR: We introduce a new threat model, Toxic Erasure (ToxE), showing that diffusion model backdoor attacks can bypass current concept erasure techniques.

Abstract: Large-scale text-to-image diffusion models pose risks of generating harmful content, including explicit imagery and fake depictions. While unlearning methods aim to remove such capabilities, we introduce a new threat model, Toxic Erasure (ToxE), showing that current erasure techniques can be bypassed via backdoor attacks. These attacks link a trigger to unwanted content, which persists despite unlearning. We demonstrate this through attacks on text encoders, cross-attention layers, and propose a deeper method, DISA, which manipulates the U-Net using a score-based loss. Across six erasure methods, DISA achieves up to 82% success in bypassing identity removal, 66% average success against object erasure and nearly triples explicit content exposure post-erasure. Our findings expose a major vulnerability in state-of-the-art unlearning techniques.

Submission Number: 7

Loading