Model X-ray : Detecting Backdoored Models via Decision Boundary

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Backdoor attacks pose a significant security vulnerability for deep neural networks (DNNs), enabling them to operate normally on clean inputs but manipulate predictions when specific trigger patterns occur. In this paper, we consider a practical post-training scenario backdoor defense, where the defender aims to evaluate whether a trained model has been compromised by backdoor attacks. Currently, post-training backdoor detection approaches often operate under the assumption that the defender has knowledge of the attack information, logit output from the model, and knowledge of the model parameters, limiting their implementation in practical scenarios. In contrast, our approach functions as a lightweight diagnostic scanning tool that operates in conjunction with other defense methods, assisting in defense pipelines. We begin by presenting an intriguing observation: the decision boundary of the backdoored model exhibits a greater degree of closeness than that of the clean model. Simultaneously, if only one single label is infected, a larger portion of the regions will be dominated by the attacked label. Leveraging this observation, drawing an analogy to X-rays in disease diagnosis, we propose Model X-ray . This novel backdoor detection approach is based on the analysis of illustrated two-dimensional (2D) decision boundaries, offering interpretability and visualization. Model X-ray can not only identify whether the target model is infected but also determine the target attacked label under the all-to-one attack strategy. Importantly, it accomplishes this solely by the predicted hard labels of clean inputs, regardless of any assumptions about attacks and prior knowledge of the training details of the model. Extensive experiments demonstrated that Model X-ray can be effective and efficient across diverse backdoor attacks, datasets, and architectures.
Primary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This work introduces Model X-ray, a lightweight method to detect backdoor attacks in deep neural networks for image classify processing deep network. Inspired by X-rays in disease diagnosis, Model X-ray accurately identifies backdoor attacks solely from clean inputs. It demonstrates effectiveness and efficiency across diverse datasets and architectures.
Supplementary Material: zip
Submission Number: 2413
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview