Where Do Erased Concepts Go in Diffusion Models?

Published: 07 May 2025, Last Modified: 29 May 2025VisCon 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Concept Erasure
TL;DR: We analyze concept erasure methods, uncovering key differences in mechanisms and robustness.
Abstract: In this paper, we uncover a dichotomy in how concept erasure methods modify diffusion models: guidance-based avoidance versus destruction-based removal. Through systematic analysis of various erasure techniques and their interactions with adversarial attacks, we demonstrate that these two distinct mechanisms lead to fundamentally different behaviors and robustness properties. To illuminate this distinction, we introduce \methodname, a training-free attack that adds controlled noise during the diffusion process. To better understand the differences between the types of erasure methods, we track how concepts evolve throughout the erasure process. We find that guidance-based methods work by disrupting the model's ability to follow text conditioning toward erased concepts, resulting in diverse alternative generations. In contrast, destruction-based approaches actively reduce the likelihood of generating the erased concept, causing consistent redirection to specific alternative concepts we term "memory sinks". Our findings suggest that the choice between guidance-based avoidance and destruction-based removal presents a fundamental trade-off between generation diversity and adversarial robustness in concept erasure.
Submission Number: 4
Loading