Adaptive Median Smoothing: Adversarial Defense for Unlearned Text-to-Image Diffusion Models at Inference Time
Abstract: Text-to-image (T2I) diffusion models have raised concerns about generating inappropriate content, such as "*nudity*". Despite efforts to erase undesirable concepts through unlearning techniques, these unlearned models remain vulnerable to adversarial inputs that can potentially regenerate such content. To safeguard unlearned models, we propose a novel inference-time defense strategy that mitigates the impact of adversarial inputs. Specifically, we first reformulate the challenge of ensuring robustness in unlearned diffusion models as a robust regression problem. Building upon the naive median smoothing for regression robustness, which employs isotropic Gaussian noise, we develop a generalized median smoothing framework that incorporates anisotropic noise. Based on this framework, we introduce a token-wise ***Adaptive Median Smoothing*** method that dynamically adjusts noise intensity according to each token's relevance to target concepts. Furthermore, to improve inference efficiency, we explore implementations of this adaptive method at the text-encoding stage. Extensive experiments demonstrate that our approach enhances adversarial robustness while preserving model utility and inference efficiency, outperforming baseline defense techniques.
Lay Summary: AI systems that convert text descriptions into images (text-to-image models) sometimes produce harmful or inappropriate content—even after safety measures have been implemented. This happens because malicious users craft specially designed text prompts that circumvent these safeguards.
To address this vulnerability, we introduce a new defense technique called "Adaptive Median Smoothing" that works during the image generation process. Our approach intelligently applies varying levels of protection based on detecting which words in a prompt are likely to trigger unwanted content.
This targeted approach effectively blocks inappropriate content while ensuring benign requests continue to produce high-quality results. Experiments demonstrate that our approach provides robust protection against malicious prompts, while preserving both the quality and efficiency of AI image generation systems.
Primary Area: Social Aspects->Security
Keywords: Diffusion Model, Machine Unlearning, Adversarial Robustness
Submission Number: 10802
Loading