Interpreting Adversarial Attacks and Defenses using Architectures with Enhanced Interpretability

Akshay G Rao; Chandra Shekar Lakshminarayanan; Arun Rajkumar

Interpreting Adversarial Attacks and Defenses using Architectures with Enhanced Interpretability

Akshay G Rao, Chandra Shekar Lakshminarayanan, Arun Rajkumar

25 Sept 2024 (modified: 10 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial attacks, adversarial defenses, computer vision, deep learning, Interpretability

TL;DR: We aim to understand how adversarial defense techniques defend models against adversarial attacks by comparing them with standard models by employing architectures with enhanced interpretation capabilities.

Abstract: Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. These attacks involve intentionally crafting perturbations to input data that, while often imperceptible to humans, can lead to incorrect predictions by the model. This phenomenon exposes vulnerabilities in deep learning systems across various applications, from image recognition to natural language processing. Adversarial training has been a popular defence technique against these adversarial attacks. The research community has been increasingly interested in interpreting robust models and understanding how they defend against attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in these architectures act as feature extractors, making them the only medium through which an adversary can attack the model. So, we use the feature network in this architecture with fully connected layers to analyse properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties between robust and standard models. We also consider this architecture having CNN layers wherein we qualitatively and quantitatively contrast gating patterns between robust and standard models. We use ideas from visualization to understand the representations used by robust and standard models.

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4608

Loading