State Space Models: A Naturally Robust Alternative to Transformers in Computer Vision

Published: 01 Jul 2025, Last Modified: 01 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robustness
Abstract: Visual State Space Models (VSSMs) have recently emerged as a promising architecture, exhibiting remarkable performance in various computer vision tasks. However, its robustness has not yet been thoroughly studied. In this paper, we delve into the robustness of this architecture through comprehensive investigations from multiple perspectives. Firstly, we assess its adversarial robustness using whole-image and patch-specific attacks, finding it superior to Transformers in whole-image attacks but vulnerable to patch-specific attacks. Secondly, we evaluate VSSMs' robustness across diverse scenarios, including natural adversarial examples, out-of-distribution data, and common corruptions. VSSMs generalize well to OOD and corrupted data but struggle with natural adversarial examples. We also analyze their gradients in white-box attacks, revealing unique vulnerabilities and defenses. Lastly, we examine their sensitivity to image structure variations, identifying weaknesses tied to disturbance distribution and spatial information. Through these comprehensive studies, we contribute to a deeper understanding of VSSMs's robustness, providing valuable insights for refining and advancing the capabilities of deep neural networks in computer vision applications.
Submission Number: 1
Loading