MAAT: Multi-timestep Alternating Adversarial Training Against Personalized Content Security in Diffusion Models
Keywords: Security; Adversarial perturbations, Personalized Content Synthesis, Adversarial Samples, Anti-Personalization, Privacy Protection
Abstract: Despite the remarkable progress in fine-tuning text-to-image diffusion models for personalized content generation, these techniques pose serious societal risks when misused for generating fake news or malicious individual-targeted content. Existing anti-customization strategies predominantly rely on adversarial attacks. However, their effectiveness remains limited due to inadequate exploration of the intrinsic properties of diffusion models. In this paper, we propose Multi-timestep Alternating Adversarial Training (MAAT), a novel approach that disrupts unauthorized model customization by strategically intervening in the diffusion process. MAAT formulates adversarial attacks as a sparse multi-task optimization problem over diffusion timesteps, and introduces an Adaptive Non-uniform Timestep Gradient Ensemble (ANTGE) to efficiently select representative timesteps. This approach enhances attack performance while significantly reducing computational overhead. We further propose a Layer-Aware Attention Targeting (LAT) loss, which jointly disrupts self-attention and cross-attention modules by selectively targeting layers whose attention maps highly correlate with identity-related regions such as faces. In addition, MAAT is a two-stage training paradigm, incorporating surrogate model pre-training and iterative adversarial refinement. Extensive experiments on two benchmark facial datasets validate that MAAT significantly outperforms existing methods in various white-box and black-box attack scenarios, with more than 20% improvements in ISM and 6.5% improvements in FDFR. The codes will be released soon.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12945
Loading