Carefully Blending Adversarial Training and Purification Improves Adversarial Robustness

14 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: adversarial robustness, adversarial training, adversarial purification, generative purification, internal representation
TL;DR: We propose a novel adversarial defence for image classifiers, merging adversarial training and purification: the internal representation of an adversarially-trained classifier is mapped to a distribution of denoised reconstructions to be classified.
Abstract: In this work, we propose a novel adversarial defence mechanism for image classification - *CARSO* - blending the paradigms of *adversarial training* and *adversarial purification* in a synergistic robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its *internal representation* associated with a potentially perturbed input onto a distribution of tentative *clean* reconstructions. Multiple samples from such distribution are classified by the same adversarially-trained model, and an aggregation of its outputs finally constitutes the *robust prediction* of interest. Experimental evaluation by a well-established benchmark of strong adaptive attacks, across different image datasets, shows that *CARSO* is able to defend itself against adaptive *end-to-end* *white-box* attacks devised for stochastic defences. Paying a modest *clean* accuracy toll, our method improves by a significant margin the *state-of-the-art* for CIFAR-10, CIFAR-100, and TinyImageNet-200 $\ell_\infty$ robust classification accuracy against AutoAttack.
Supplementary Material: zip
Primary Area: Safety in machine learning
Submission Number: 11347
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview