A Concept is More Than a Word: Diversified Unlearning in Text-to-Image Diffusion Models

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Unlearning, Diffusion Models, Generative Models
Abstract: Concept unlearning has emerged as a promising direction for reducing the risks of harmful content generation in text-to-image diffusion models by selectively erasing undesirable concepts from a model’s parameters. Existing approaches typically rely on keywords to identify the target concept. However, we show that this keyword-based formulation is inherently limited: concepts are multi-dimensional, can be expressed in diverse textual forms, and often overlap with related concepts in the latent space, making keyword-only unlearning brittle and prone to over-forgetting. To address this limitation, we propose \textbf{Diversified Unlearning}, a distributional framework that represents a concept through a set of contextually diverse prompts rather than a single keyword. This richer representation enables more precise and robust unlearning. Through extensive experiments across multiple benchmarks and state-of-the-art baselines, we demonstrate that Diversified Unlearning consistently achieves stronger erasure, better retention of unrelated concepts, and improved robustness against adversarial recovery attacks. All experimental results and detailed implementations can be found at \url{https://anonymous.4open.science/r/Diversified_Unlearning}
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 3912
Loading