TL;DR: We present a comprehensive empirical study on how backdoor attacks affect CLIP by analyzing the representations of backdoor images.
Abstract: We present a comprehensive empirical study on how backdoor attacks affect CLIP by analyzing the representations of backdoor images. Specifically, based on the methodology of representation decomposing, image representations can be decomposed into a sum of representations across individual image patches, attention heads (AHs), and multi-layer perceptrons (MLPs) in different model layers. By examining the effect of backdoor attacks on model components, we have the following empirical findings. (1) Different backdoor attacks would infect different model components, i.e., local patch-based backdoor attacks mainly affect AHs, while global perturbation-based backdoor attacks mainly affect MLPs. (2) Infected AHs are centered on the last layer, while infected MLPs are decentralized on several late layers. (3) Not all AHs in the last layer are infected and even some AHs could still maintain the original property-specific roles (e.g., ''color" and ''location''). These observations motivate us to defend against backdoor attacks by detecting infected AHs, repairing their representations, or filtering backdoor samples with too many infected AHs, in the inference stage. Experimental results validate our empirical findings and demonstrate the effectiveness of the defense methods.
Lay Summary: Which CLIP components are affected by various backdoor attacks? Which model layers are affected most? How's the change in the functional roles of affected components? We conducted a comprehensive empirical study to answer these questions.
We found that (1) different backdoor attacks would infect different model components, (2) Infected attention heads (AHs) are centered on the last layer, while infected multi-layer perceptrons (MLPs) are decentralized on several late layers. (3) Not all AHs in the last layer are infected, and even some ones could still maintain the original property-specific roles.
Based on these findings, to defend against backdoor attacks during inference, we propose to detect infected AHs, repair their representations, or filter backdoor samples with too many infected AHs.
Primary Area: Deep Learning
Keywords: Backdoor Attacks, CLIP, Representation Decomposing, Backdoor Defense
Submission Number: 9820
Loading