Adversarial Defense using Targeted Manifold Manipulation

Banibrata Ghosh; HARIPRIYA HARIKUMAR; Svetha Venkatesh; Santu Rana

Adversarial Defense using Targeted Manifold Manipulation

Banibrata Ghosh, HARIPRIYA HARIKUMAR, Svetha Venkatesh, Santu Rana

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: pdf

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Adversarial Defense, Backdoor, Deep Learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Adversarial attacks on deep models are often guaranteed to find a small and innocuous perturbation to easily alter class label of a test input. We use a novel Targeted Manifold Manipulation approach to direct the gradients from the genuine data manifold towards carefully planted trapdoors during such adversarial attacks. The trapdoors are assigned an additional class label (Trapclass) to make the attacks falling in them easily identifiable. Whilst low-perturbation budget attacks will necessarily end up in the trapdoors, high-perturbation budget attacks may escape but only end up far away from the data manifold. Since our manifold manipulation is enforced only locally, we show that such out-of-distribution data can be easily detected by noting the absence of trapdoors around them. Our detection algorithm avoids learning a separate model for attack detection and thus remain semantically aligned with the original classifier. Further, since we manipulate the adversarial distribution it avoids the fundamental difficulty associated with overlapping distributions of clean and attack samples for usual, unmanipulated models. We use six state-of-the-art adversarial attacks with four well-known image datasets to evaluate our proposed defense. Our results show that the proposed method can detect \sim99% attacks without significant drop in clean accuracy whilst also being robust to semantic-preserving, non-attack perturbations.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4256

Loading