Abstract: Neural Trojan/Backdoor attacks pose a significant threat to the current deep-learning-based systems and are hard to defend against due to the lack of knowledge about triggers. In this paper, we first introduce a variant of BadNet that uses multiple triggers to control multiple target classes and allows these triggers to be at any location in the input image. These features make our attack more potent and easier to be conducted in real-world scenarios. We empirically found that many well-known Trojan defenses fail to detect and mitigate our proposed attack. To defend against this attack, we then introduce an image-specific trigger reverse-engineering mechanism that uses multiple images to recover a variety of potential triggers. We then propose a detection mechanism by measuring the transferability of such recovered triggers. A Trojan trigger will have very high transferability i.e. they make other images also go to the same class. We study many practical advantages of our attack and then apply our proposed defense mechanism to a variety of image datasets. The experimental results show the superiority of our method over the state-of-the-arts.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Ran_He1
Submission Number: 2472
Loading