Keywords: adversarial attacks, adversarial defenses, computer vision, deep learning, Interpretability
TL;DR: We aim to understand how adversarial defense techniques defend models against adversarial attacks by comparing them with standard models by employing architectures with enhanced interpretation capabilities.
Abstract: Adversarial attacks in deep learning represent a significant threat to the integrity and
reliability of machine learning models. These attacks involve intentionally crafting
perturbations to input data that, while often imperceptible to humans, can lead
to incorrect predictions by the model. This phenomenon exposes vulnerabilities
in deep learning systems across various applications, from image recognition to
natural language processing. Adversarial training has been a popular defence
technique against these adversarial attacks. The research community has been
increasingly interested in interpreting robust models and understanding how they
defend against attacks.
In this work, we capitalize on a network architecture, namely Deep Linearly Gated
Networks (DLGN), which has better interpretation capabilities than regular network
architectures. Using this architecture, we interpret robust models trained using PGD
adversarial training and compare them with standard training. Feature networks
in these architectures act as feature extractors, making them the only medium
through which an adversary can attack the model. So, we use the feature network
in this architecture with fully connected layers to analyse properties like alignment
of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among
classes and compare these properties between robust and standard models. We
also consider this architecture having CNN layers wherein we qualitatively and
quantitatively contrast gating patterns between robust and standard models. We
use ideas from visualization to understand the representations used by robust and
standard models.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4608
Loading