"We Must Protect the Transformers": Understanding Efficacy of Backdoor Attack Mitigation on Transformer Models

Rohit Raj, Biplab Roy, Abir Das, Mainack Mondal

Published: 01 Jan 2023, Last Modified: 18 May 2025SPACE 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, Neural Network based Deep Learning (DL) backdoor attacks have prompted the development of mitigation mechanisms for such attacks. Out of them a key mitigation mechanism is Neural Cleanse, which helps in the identification and mitigation of DL backdoor attacks. It identifies the presence of backdoors in Neural Networks and constructs a reverse-engineered trigger, which is later used to mitigate the backdoor present in the infected model. However, since the publication of Neural Cleanse, newer DL architectures (e.g., Transformer models) have emerged and are widely used. Unfortunately, it is not clear if Neural Cleanse is effective to mitigate backdoor attacks in these newer models—in fact a negative answer will prompt researchers to rethink backdoor attack mitigation. To that end, in this work, we take the first step to explore this question. We considered models ranging from pure convolution-based models like ResNet-18 to pure Self-Attention based models like ConVit and understand the efficacy of Neural Cleanse after launching backdoor attacks on these models. Our experiments uncover a wide variation in the efficacy of Neural Cleanse. Even if Neural Cleanse effectively counters backdoor attacks in some models, its performance falls short when dealing with models incorporating self-attention layers (i.e., Transformers), especially in accurately identifying target classes and learning reverse-engineered triggers. Our results further hint that, for modern models, mitigation of backdoor attacks by constructing reverse engineering triggers should consider patches (instead of pixels).