BaDLoss: Backdoor Detection via Loss Dynamics

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: data poisoning, backdoors
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: BaDLoss is a new detection method for data poisoning backdoors using per-example loss dynamics to detect anomalies. It works well in standard evaluations as well as a novel, more realistic evaluation.
Abstract: Backdoor attacks often inject synthetic features into a training dataset. Images classified with these synthetic features often demonstrate starkly different training dynamics when compared to natural images. Previous work has identified this phenomenon, claiming that backdoors are outliers (Hayase et al. 2021) or particularly strong features (Khaddaj et al. 2023), consequently being harder or easier to learn compared to regular examples. We instead identify backdoors as having \textit{different}, anomalous training dynamics. With this insight, we present BaDLoss, a robust backdoor detection method. BaDLoss injects specially chosen probes that model anomalous training dynamics and tracks the loss trajectory for each example in the dataset, enabling the identification of unknown backdoors in the training set. Our method effectively transfers zero-shot to novel backdoor attacks without prior knowledge. Additionally, BaDLoss can detect multiple concurrent attacks, setting it apart from most existing approaches. By removing identified examples and retraining, BaDLoss eliminates the model's vulnerability to most attacks, far more effectively than previous defenses.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7630
Loading