From Trojan Horses to Castle Walls: Unveiling Bilateral Backdoor Effects in Diffusion Models

Published: 28 Oct 2023, Last Modified: 13 Mar 2024NeurIPS 2023 BUGS PosterEveryoneRevisionsBibTeX
Keywords: Backdoor attack, backdoor defense, diffision model, diffusion classifier
TL;DR: A more practical backdoor attack on diffusion models, and backdoor detection/defense with the help of DM.
Abstract: While state-of-the-art diffusion models (DMs) excel in image generation, concerns regarding their security persist. Earlier research highlighted DMs' vulnerability to backdoor attacks, but these studies placed stricter requirements than conventional methods like 'BadNets' in image classification. This is because the former necessitates modifications to the diffusion sampling and training procedures. Unlike the prior work, we investigate whether generating backdoor attacks in DMs can be as simple as BadNets, *i.e.*, by only contaminating the training dataset without tampering the original diffusion process. In this more realistic backdoor setting, we uncover *bilateral backdoor effects* that not only serve an *adversarial* purpose (compromising the functionality of DMs) but also offer a *defensive* advantage (which can be leveraged for backdoor defense). On one hand, a BadNets-like backdoor attack remains effective in DMs for producing incorrect images that do not align with the intended text conditions. On the other hand, backdoored DMs exhibit an increased ratio of backdoor triggers, a phenomenon referred as 'trigger amplification', among the generated images. We show that the latter insight can be utilized to improve the existing backdoor detectors for the detection of backdoor-poisoned data points. Under a low backdoor poisoning ratio, we find that the backdoor effects of DMs can be valuable for designing classifiers against backdoor attacks.
Submission Number: 20