Keywords: Text-to-Image Safety, Seed-Invariant Defense, Diffusion Models, Noise Contrastive Alignment, Harmful Content Mitigation
Abstract: Text-to-image diffusion models have achieved remarkable success in generating high-quality images, yet existing safety mechanisms exhibit critical cross-seed instability where defense performance varies significantly under different random seed conditions. This instability stems from the fact that a single malicious prompt generates diverse harmful variants across different noise initializations, forming complex distributional clusters that current methods cannot adequately address. We investigate extending Noise Contrastive Alignment (NCA) to diffusion models due to its native capability of handling multiple negative samples through probabilistic weighting, but our theoretical analysis reveals two fundamental flaws in direct extension: gradient reversal caused by positive regularization terms that paradoxically penalize safe content generation, and uniform suppression of harmful samples that ignores severity variations. To tackle these issues, we propose Noise Contrastive Diffusion (NCD), which incorporates targeted algorithmic modifications including elimination of problematic regularization and introduction of pairwise regularization mechanisms that establish individualized preference relationships between safe and harmful variants. Extensive experiments further demonstrate that NCD achieves superior cross-seed stability, reducing attack success rates (ASRs) from 11.1% to 6.2% compared to SOTA methods while maintaining exceptional generation quality, exhibiting robust resistance against sophisticated jailbreak prompts and strong generalizability across different T2I architectures. WARNING: This paper may contain examples of harmful texts and images.
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17356
Loading