Dissecting Gradient Masking and Denoising in Diffusion Models for Adversarial Purification

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Adversarial Purification, Diffusion Models, Randomness-Induced Gradient Masking
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Diffusion models exhibit remarkable empirical robustness in adversarial purification. The mechanisms underlying such improvements remain unclear. It is possible that diffusion models effectively purify the adversarial examples via the learned stimuli prior. Alternatively, the substantial randomness added in the diffusion models may cause gradient masking that contaminates the empirical estimate of adversarial robustness. Here, we seek to dissect the contribution of these two potential factors. Theoretically, we illustrate how a purification system with randomness can cause gradient masking, which can not be addressed by the standard expectation-over-time (EOT) method. Inspired by this, we propose and justify that a simple procedure, randomness replay, can provide a better robustness estimate when randomness is involved. Experimentally, we verify that gradient masking indeed happens under previous evaluations of diffusion models. After properly controlling the effect of randomness, the reverse-only diffusion model (RevPure) provides a better robustness improvement than the previous DiffPure framework, suggesting that the robustness improvement is solely attributed to the reverse process. Furthermore, our analyses reveal that robustness improvement is caused by a sequential denoising mechanism that transforms the stimulus to a direction orthogonal to the original adversarial perturbation, rather than reducing the $\ell_2$ distance between the transformed and clean stimuli. Our results shed new light on the mechanisms underlying the empirical robustness from diffusion models, and shall inform future development of more efficient adversarial purification systems.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7175
Loading