Poisoned classifiers are not only backdoored, they are fundamentally broken

Mingjie Sun; Siddhant Agarwal; J Zico Kolter

Poisoned classifiers are not only backdoored, they are fundamentally broken

Mingjie Sun, Siddhant Agarwal, J Zico Kolter

28 Sept 2020 (modified: 12 Oct 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Backdoor Attacks, Denoised Smoothing, Perceptually-Aligned Gradients

Abstract: Under a commonly-studied “backdoor” poisoning attack against classification models, an attacker adds a small “trigger” to a subset of the training data, such that the presence of this trigger at test time causes the classifier to always predict some target class. It is often implicitly assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger. In this paper, we show empirically that this view of backdoored classifiers is fundamentally incorrect. We demonstrate that anyone with access to the classifier, even without access to any original training data or trigger, can construct several alternative triggers that are as effective or more so at eliciting the target class at test time. We construct these alternative triggers by first generating adversarial examples for a smoothed version of the classifier, created with a recent process called Denoised Smoothing, and then extracting colors or cropped portions of adversarial images. We demonstrate the effectiveness of our attack through extensive experiments on ImageNet and TrojAI datasets, including a user study which demonstrates that our method allows users to easily determine the existence of such backdoors in existing poisoned classifiers. Furthermore, we demonstrate that our alternative triggers can in fact look entirely different from the original trigger, highlighting that the backdoor actually learned by the classifier differs substantially from the trigger image itself. Thus, we argue that there is no such thing as a “secret” backdoor in poisoned classifiers: poisoning a classifier invites attacks not just by the party that possesses the trigger, but from anyone with access to the classifier.

One-sentence Summary: We show that backdoored classifiers can be easily attacked without access to the original trigger, by constructing alternative triggers that are just as effective as, or even more so than the original one that are as successful as the original one.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/poisoned-classifiers-are-not-only-backdoored/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=dJGbvSguQO

13 Replies

Loading