CARSO: Blending Adversarial Training and Purification Improves Adversarial Robustness

Emanuele Ballarin; Alessio ansuini; Luca Bortolussi

CARSO: Blending Adversarial Training and Purification Improves Adversarial Robustness

Emanuele Ballarin, Alessio ansuini, Luca Bortolussi

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: societal considerations including fairness, safety, privacy

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: adversarial robustness, adversarial training, adversarial purification, generative purification, internal representation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We propose a novel adversarial defence for image classifiers, merging adversarial training and purification: the internal representation of an adversarially-trained classifier is mapped to a distribution of denoised reconstructions to be classified.

Abstract: In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of *adversarial training* and *adversarial purification* in a mutually-beneficial, robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its *internal representation* associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the adversarially-trained model, and an aggregation of its outputs finally constitutes the *robust prediction* of interest. Experimental evaluation by a well-established benchmark of varied, strong adaptive attacks, across different image datasets and classifier architectures, shows that CARSO is able to defend itself against foreseen and unforeseen threats, including adaptive *end-to-end* attacks devised for stochastic defences. Paying a tolerable *clean* accuracy toll, our method improves by a significant margin the *state of the art* for CIFAR-10 and CIFAR-100 $\ell_\infty$ robust classification accuracy against AutoAttack.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3670

Loading