General Cross-Attack Backdoor Detector Based on Disturbance Immunity of Triggers

12 Sept 2025 (modified: 14 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Backdoor Detection; Backdoor Attack;Backdoor Defense;Cross-attack
Abstract: Backdoor attacks aim to manipulate the behavior of DNNs under trigger-activated conditions. Data poisoning represents a standard approach for embedding triggers into victim models. Current backdoor detectors struggle to separate trigger-injected samples from the poisoned dataset, which severely suffer from two dilemmas. (1) Modern backdoor features are usually highly coupled with benign features. Existing detectors are almost pixel-based methods, which critically hinder the recognition performance of backdoor features. (2) Owing to the prior lack of poisoned sample distributions, most detectors are restricted to employing approaches akin to unsupervised clustering-based methods. Thus, they heavily rely on sufficient clean samples and deficient artificial priors to efficiently search for poisoned samples with poor generalization across various attacks. This paper introduces a brand-new perspective to reformulate the attackers' objective, ***i.e., backdoor attacks lead the victim models to classify the trigger disturbed by images into the target label***, to identify the community of attacks. Specifically, we propose the concept, ***Disturbance Immunity*** of triggers, and ***theoretically demonstrate that benign and backdoor features exhibit significant classification probability discrepancies across varying perturbations of clean image classes and intensities***. Subsequently, a few known conventional attack patterns are applied to label the poisoned dataset, and then the labeled dataset is perturbed in the above manner to drive the detector to learn the Disturbance Immunity of triggers. Thus, traditional unsupervised clustering-based detection can be transformed into a simple labeled binary classification task. ***Currently, few method provides detection work based on direct commonality transfer, nor do they break the feature separation task with a labeled-conversion detection framework.*** Finally, we train and present an effective ***G***eneral ***C***ross-attack ***B***ackdoor ***D***etector (***GCBD***). With few clean images $(\leq 10)$, GCBD exhibits ***S***tate-***O***f-***T***he-***A***rt (***SOTA***) detection performance with satisfactory generalization on various SOTA attacks. Additionally, GCBD also supports direct toxicity detection in unseen samples during training, as proved by a more challenging test-time validation approach. Our code will be released soon.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4476
Loading