Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

TMLR Paper2573 Authors

23 Apr 2024 (modified: 27 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper presents RADAR---Robust Adversarial Detection via Adversarial Retraining---an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly. Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: * Ablation study has been added. * Comparison to other detectors has been added. * ImageNet1k subset has been added. * Classification accuracy table has been added. * Definition of K in Eq. 1 has been added. * Layout of section 5 has been fixed. * Section 3.2 clarification: Improved explanation of the optimization objective for attack optimization. * Ablation study on clean training: Added an ablation study for the initial clean training phase (Figure 9). * Accuracy computation: Expanded explanation of accuracy calculation and sources of accuracy drop in Table 4. * Accuracy against adaptive attacks: Included classification accuracies for all classifiers across all epsilon values (Figure 9). * Generalization on transfer attacks: Added Figure 2 to show the detector’s generalization to attacks from other surrogate models.

Assigned Action Editor: ~W_Ronny_Huang1

Submission Number: 2573

Loading