Abstract: State Space Models (SSMs) provide linear-time alternatives to attention for vision, but require serializing 2D images into 1D sequences using a predefined scan order. We identify scan order as a previously underexplored inductive bias that fundamentally shapes spatial dependency modeling in Vision SSMs. Fixed scan paths distort local adjacency, fragment object structure, and induce anisotropic representations that are brittle under geometric transformations such as rotation. We propose Partial RIng Scan Mamba (PRIS-Mamba), a rotation-robust traversal that decomposes images into concentric rings, performs permutation-invariant aggregation within each ring, and models cross-ring dependencies via short radial SSMs. This design induces a structured factorization of spatial dependencies that preserves isotropy while maintaining linear complexity. To improve efficiency without sacrificing expressivity, we introduce partial channel filtering, selectively applying recurrent modeling to informative channels while routing others through a residual pathway. Empirically, PRIS-Mamba improves accuracy, efficiency, and rotation robustness over prior Vision SSMs on ImageNet-1K. Our results position scan-order design as a core representational choice in Vision SSMs, with implications for robustness and generalization beyond architectural scaling. The code will be released upon paper acceptance.
Lay Summary: Many vision models process an image by breaking it into small pieces and reading them in a fixed order, such as from left to right. Although this may seem like a small design choice, the reading order can affect how well the model understands objects, especially when an image is rotated or when object parts are separated along the reading path. This paper proposes a new way to read images using concentric rings. Instead of following a fixed straight-line path, the model gathers information ring by ring, from the center outward. This helps preserve the geometric structure of the image and makes the model more stable under rotation. We also reduce computation by applying heavier processing only to the most useful parts of the image representation. Experiments show that our method improves image classification, object detection, and segmentation while using less computation than several existing vision models. Overall, this work shows that the order in which a model reads an image is an important factor for building efficient and robust visual recognition systems.
Originally Submitted Supplementary Material: zip
Primary Area: Applications->Computer Vision
Keywords: Vision State Space Models, Mamba, Vision Mamba, Scan order, Ring scan, Partial channel filtering
Originally Submitted PDF: pdf
Submission Number: 31531
Loading