SELECTIVE FINE-TUNING FOR TARGETED AND ROBUST CONCEPT UNLEARNING

ICLR 2026 Conference Submission19101 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unlearning, Stable Diffusion, Alignment, Model Safety
TL;DR: Given the brittle nature of existing methods in unlearning harmful content in diffusion models, we propose TRuST, a novel approach for dynamically estimating target concept neurons and unlearning them by selectively fine-tuning.
Abstract: Text-guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models’ likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state-of-the-art methods depend on full fine-tuning, which is computationally expensive. Concept localisation methods can facilitate selective fine-tuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine-Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective fine-tuning, empowered by a Hessian-based regularization. We show experimentally, against a number of SOTA baselines, that TRUST is robust against adversarial prompts, preserves generation quality to a significant degree (∆F ID = 0.02), and is also significantly (2.5 times) faster than the SOTA. Our method achieves unlearning of not only individual concepts but also combinations of concepts and conditional concepts, without any specific regularization.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19101
Loading