SSD: A Sparse Semantic Defense Against Semantic Adversarial Attacks to Image Classifiers

Nghia Nguyen; Darshan Thaker; Konstantinos Emmanouilidis; Tianjiao Ding; Rene Vidal

SSD: A Sparse Semantic Defense Against Semantic Adversarial Attacks to Image Classifiers

Nghia Nguyen, Darshan Thaker, Konstantinos Emmanouilidis, Tianjiao Ding, Rene Vidal

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial robustness, interpretability, sparse coding

Abstract: Adversarial attacks to image classifiers pose a major threat to machine learning models. However, existing defenses against such attacks have been designed mostly for unrealistic image threat models, such as bounded $\ell_p$-norm image perturbations. In this paper, we focus on defending against more realistic semantic adversarial attacks, which modify semantic image concepts (e.g., make it in snow) that are irrelevant to the underlying classification task (e.g., classify a dog). Intuitively, a classifier that is robust to semantic attacks should rely only on concepts that are relevant for the task. Therefore, the proposed Sparse Semantic Defense (SSD) uses large language models to build a dictionary of visual concepts that are relevant for a given visual recognition task, and large vision-language models to embed images and concepts into an aligned, shared latent space. Sparse coding is then used to decompose the image embedding as a sparse combination of the text embeddings of relevant concepts plus a residual term that captures irrelevant concepts, including semantic attacks. We provide a theoretical justification for why sparse coding can separate irrelevant semantics from the resulting sparse code. A simple linear classifier on the sparse code is then used. Note that SSD is robust to semantic attacks by design because it relies only on semantic concepts that are relevant to the task. SSD is also interpretable by design because it relies on task-relevant visual concepts. Experiments on ImageNet show that SSD performs favorably with respect to other baselines in terms of robust accuracy against semantic adversarial attacks while maintaining interpretability.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 4860

Loading