Keywords: Safety of Generative AI; Diffusion Models
TL;DR: This paper introduces a training-free method to make diffusion models safer by directly modifying their sampling process to avoid generating undesirable content like NSFW images or copyrighted material, without needing to retrain the models.
Abstract: There is growing concern over the safety of powerful diffusion models, as they are often misused to produce inappropriate, not-safe-for-work content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or retraining the model to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or private data) to avoid specific regions of data distribution, without needing to retrain or fine-tune the model. We formally derive the relationship between the expected denoised samples that are safe and those that are unsafe, leading to our *safe* denoiser, which ensures its final samples are away from the area to be negated. We achieve state-of-the-art safety performance in large-scale datasets such as the CoPro dataset while also enabling significantly more cost-effective sampling than existing methodologies.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 9625
Loading