Keywords: Diffusion Watermarks, Adversarial Attack, Watermark Forgery Attack, Watermark Removal Attack
Abstract: Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media.
Recently, a prominent approach to watermarking diffusion models relies on embedding a secret key in the initial noise. The resulting pattern is often considered hard to forge into unrelated images and remove. In this paper, we make a key observation that there is an inherent many-to-one mapping between images and initial noises. Therefore, there are regions in the clean image latent space pertaining to each watermark that get mapped to the same initial noise when inverted. We expose this as a vulnerability by proposing a black-box adversarial attack using only a single watermarked image and without presuming access to any diffusion model. Our forgery attack simply adds perturbations to unrelated, potentially harmful images so that they would enter the region of watermarked images and get falsely labeled as watermarked. We show that a similar approach can also be applied to watermark removal by learning perturbations to exit this region. We report results on multiple watermarking schemes (Tree-Ring, RingID, WIND, and Gaussian Shading). Our results demonstrate the effectiveness of the attack and expose vulnerabilities in current watermarking methods, motivating future research on improving them.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 8982
Loading