A Separable Self-attention Inspired by the State Space Model for Computer Vision

Juntao Zhang; Jun Zhou; Kun Bian; zhou you; Jianning Liu

A Separable Self-attention Inspired by the State Space Model for Computer Vision

Juntao Zhang, Jun Zhou, Kun Bian, zhou you, Jianning Liu

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Separable Self-attention, State Space Models, Vision Mamba

TL;DR: Inspired by the state space model at the core of Vision Mamba, we propose a novel separable self-attention mechanism termed Vision Mamba-Inspired Separable Self-Attention.

Abstract: Separable self-attention is an early attention mechanism with linear complexity. When parameters and FLOPs are comparable, lightweight networks built upon separable self-attention and its variants underperform the recent Vision Mamba (ViM). By analyzing the strengths and weaknesses of separable self-attention, we distill four design principles and, inspired by the State Space Model (SSM) serving as the core of ViM, propose a novel separable self-attention termed Vision Mamba Inspired Separable self-Attention (VMI-SA). Notably, VMI-SA does not incorporate any SSM blocks, and its attention computation process differs from all existing attention mechanisms to the best of our knowledge. We introduce proof-of-concept networks, VMINet and VMIFormer, enabling fair comparisons with ViMs through deliberate control of parameters, FLOPs, and encoder numbers. Compared to state-of-the-art Transformers, CNNs, and ViMs, VMINet and VMIFormer achieve competitive results in image classification and high-resolution dense prediction tasks.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6579

Loading