MDTD: A Multi-Domain Trojan Detector for Deep Neural Networks

Arezoo Rajabi, Surudhi Asokraj, Fengqing Jiang, Luyao Niu, Bhaskar Ramasubramanian, James A. Ritcey, Radha Poovendran

Published: 2023, Last Modified: 03 Feb 2024CCS 2023Readers: Everyone

Abstract: Machine learning models that use deep neural networks (DNNs) are vulnerable to backdoor attacks. An adversary carrying out a backdoor attack embeds a predefined perturbation called a trigger into a small subset of input samples and trains the DNN such that the presence of the trigger in the input results in an adversary-desired output class. Such adversarial retraining however needs to ensure that outputs for inputs without the trigger remain unaffected and provide high classification accuracy on clean samples. Existing defenses against backdoor attacks are computationally expensive, and their success has been demonstrated primarily on image-based inputs. The increasing popularity of deploying pretrained DNNs to reduce costs of re/training large models makes defense mechanisms that aim to detect 'suspicious' input samples preferable. In this paper, we propose MDTD, a Multi-Domain Trojan Detector for DNNs, which detects inputs containing a Trojan trigger at testing time. MDTD does not require knowledge of trigger-embedding strategy of the attacker and can be applied to a pretrained DNN model with image, audio, or graph-based inputs. MDTD leverages an insight that input samples containing a Trojan trigger are located relatively farther away from a decision boundary than clean samples. MDTD estimates the distance to a decision boundary using adversarial learning methods and uses this distance to infer whether a test-time input sample is Trojaned or not. We evaluate MDTD against state-of-the-art Trojan detection methods across five widely used image-based datasets- CIFAR100, CIFAR10, GTSRB, SVHN, and Flowers102, four graph-based datasets- AIDS, WinMal, Toxicant, and COLLAB, and the SpeechCommand audio dataset. Our results show that MDTD effectively identifies samples that contain different types of Trojan triggers. We further evaluate MDTD against adaptive attacks where an adversary trains a robust DNN to increase (decrease) distance of benign (Trojan) inputs from a decision boundary. Although such training by the adversary reduces the detection rate of MDTD, this is accomplished at the expense of reducing classification accuracy or adversary success rate, thus rendering the resulting model unfit for use.

0 Replies