Adversarially Robust Anti-Backdoor Learning

Qi Zhao, Christian Wressnegger

Published: 18 Oct 2024, Last Modified: 29 Sept 2024Proc. of the 17th ACM Workshop on Artificial Intelligence and Security (AISec) co-located with the 31st ACM Conference on Computer and Communications Security inEveryoneCC BY 4.0

Abstract: Defending against data poisoning-based backdoors at training time is notoriously difficult due to the wide range of attack variants. Recent attacks use perturbations/triggers subtly entangled with the benign features, impeding the separation of poisonous and clean training samples as required for learning a clean model. In this paper, we demonstrate that such a strict separation is not necessarily needed in practice, though. Our method, A-ABL, is rooted in the observation that considering training-time defenses against adversarial examples and backdoors simultaneously relaxes the requirements for each task individually. First, we learn a naive model on the entire training data and use it to derive adversarial examples for each sample. Second, we remove those training samples for which the adversarial perturbation (budget) was insufficient to flip the prediction, following the rationale that these are related to a profoundly embedded shortcut to the backdoor’s target class. Finally, we adversarially train a model on the remaining data. Training with at least the same perturbation budget used in the first step pushes the remaining poisonous samples away from the backdoor target, preventing backdoor injection while also hardening the model against adversarial examples. This way, our method removes backdoors on par with more complex anti-backdoor learning techniques, additionally yielding an adversarially robust model.