Rethinking the Vulnerability of Concept Erasure and a New Method

Rethinking the Vulnerability of Concept Erasure and a New Method

ICLR 2026 Conference Submission13764 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial attack, diffusion model, concept erasure, machine unlearning

TL;DR: We investigate the vulnerablity of concept erasure in diffusion models and developed an algorithm to restore erased concepts

Abstract: The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. In response, concept erasure (defense) methods have been developed to "unlearn" specific concepts through post-hoc finetuning. However, recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts, revealing a critical vulnerability in current defense mechanisms. In this work, we first investigate the fundamental sources of adversarial vulnerability and reveal that vulnerabilities are pervasive in the prompt embedding space of concept-erased models, a characteristic inherited from the original pre-unlearned model. Furthermore, we introduce RECORD, a novel tangential-coordinate-descent-based restoration algorithm that consistently outperforms existing restoration methods by up to 17.8 times. We conduct extensive experiments to assess its compute-performance tradeoff and propose acceleration strategies. The code for RECORD is available at ***. Note: this paper may contain offensive or upsetting images.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 13764

Loading