Keywords: concept recovery;sparse recovery;sparse coding;adversarial robustness;vision-language models
Abstract: Adversarial attacks to image classifiers pose a major threat to machine learning models. However, existing defenses against such attacks have been designed mostly for unrealistic image threat models, such as bounded $\ell_p$-norm image perturbations. In this paper, we focus on defending against more realistic *semantic adversarial attacks*, which modify semantic image concepts (e.g., make it in snow) that are irrelevant to the underlying classification task (e.g., classify a dog). Intuitively, a classifier that is robust to semantic attacks should rely only on concepts that are relevant for the task. Therefore, the proposed Sparse Semantic Concept Defense (SSCD) uses large language models to build a dictionary of visual concepts that are relevant for a given visual recognition task, and large vision-language models to embed images and concepts into an aligned, shared latent space. Sparse coding is then used to decompose the image embedding as a sparse combination of the text embeddings of relevant concepts plus a residual term that captures irrelevant concepts, including semantic attacks. We provide a theoretical justification for why sparse coding can separate irrelevant semantics from the resulting sparse code. A simple linear classifier on the sparse code is then used. SSCD is also interpretable by design because it relies on task-relevant visual concepts. Experiments on ImageNet show that SSCD performs favorably with respect to other baselines in terms of robust accuracy against semantic adversarial attacks while maintaining interpretability.
Submission Number: 38
Loading