Repair Aware Forgetting: An Iterative Approach to Unlearning in T2I Diffusion Models

15 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: unlearning, image generation, text-to-image, diffusion models, preference alignment
TL;DR: PURE is a repair-aware unlearning approach for T2I diffusion that uses a negative-only preference based forget loss, with an alternating forget-then-repair schedule, to erase targeted content while preserving retain accuracy and low FID.
Abstract: Text-to-image diffusion models trained on web-scale data can reproduce unsafe, private, or copyrighted content. We seek an unlearning procedure that removes such content while explicitly preserving benign performance. We formulate unlearning as repair-aware constrained optimization and introduce PURE (Preference-based UnleaRning in tExt-to-image diffusion). PURE operationalizes this with three ideas: (i) a distributional trust region around a strong reference model via a KL penalty so forgetting cannot drift on retain prompts; (ii) a diffusion-tailored, negative-only preference objective that downweights unsafe generations without paired safe examples; and (iii) an alternating schedule of short forgetting steps and lightweight repair steps on retain data that yields self-stabilizing updates and keeps image quality high. On Imagenette, PURE achieves almost perfect unlearning in 100 steps with near-perfect retain accuracy and the best FID among baselines. On I2P, it reduces NSFW generations by over 50% relative to prior state of the art, using only 50 forget samples on a single GPU. PURE is simple to implement, and both sample and compute efficient. Overall, PURE consistently outperforms ESD, FMN, and SalUn on both unlearning efficacy and fidelity, demonstrating a practical path to safe T2I diffusion without retraining or paired supervision.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5936
Loading