On the Existence of a Trojaned Twin ModelDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: Backdoor Attack, Trojan Attack
TL;DR: A mathematical model for backdoor attack, show the existence of a trojaned twin model of a clean model
Abstract: We study the Trojan Attack problem, where malicious attackers sabotage deep neural network models with poisoned training data. In most existing works, the effectiveness of the attack is largely overlooked; many attacks can be ineffective or inefficient for certain training schemes, e.g., adversarial training. In this paper, we adopt a novel perspective and look into the quantitative relationship between a clean model and its Trojaned counterpart. We formulate a successful attack using classic machine learning language. Under mild assumptions, we show theoretically that there exists a Trojaned model, named Trojaned Twin, that is very close to the clean model. This attack can be achieved by simply using a universal Trojan trigger intrinsic to the data distribution. This has powerful implications in practice; the Trojaned twin model has enhanced attack efficacy and strong resiliency against detection. Empirically, we show that our method achieves consistent attack efficacy across different training schemes, including the challenging adversarial training scheme. Furthermore, this Trojaned twin model is robust against SoTA detection methods
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Social Aspects of Machine Learning (eg, AI safety, fairness, privacy, interpretability, human-AI interaction, ethics)
12 Replies

Loading