GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention

ICLR 2026 Conference Submission12738 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Models, Malicious Fine-tuning, AI Safety, Ethical Alignment, Immunization
TL;DR: GIFT is a gradient-aware immunization framework that uses bi-level optimization to degrade a diffusion model’s ability to learn harmful concepts while preserving its performance on benign data.
Abstract: We present GIFT: a Gradient-aware Immunization technique to defend diffusion models against malicious Fine-Tuning while preserving their ability to generate safe content. Existing safety methods, such as safety checkers, are easily bypassed, and concept erasure methods fail under adversarial fine-tuning. GIFT addresses this by framing immunization as a bi-level optimization problem: the upper-level objective degrades the model’s ability to represent malicious concepts using representation noising and maximization, while the lower-level objective preserves performance on safe data. GIFT achieves robust resistance to malicious fine-tuning while maintaining safe utility. Experimental results show that GIFT significantly impairs the model’s ability to re-learn malicious concepts while maintaining performance on safe content, offering a promising direction for creating inherently safer generative models resistant to adversarial fine-tuning attacks.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12738
Loading