Toward Compact and Structured Visual Representations in VLMs: SSM-Based Vision Encoders as an Alternative to Transformers
Keywords: Vision-language model, Vision state-space model
TL;DR: We show that SSM-based vision encoders produce more compact and spatially structured visual tokens than ViT-family alternatives, improving both grounding and general VQA in vision-language models without increasing token budgets.
Abstract: Spatial concept understanding is important for vision-language models (VLMs), not only for explicit grounding and localization tasks but also for general visual question answering. The common solution to this challenge has been to scale image resolution or visual token counts, yet this conflicts with the quadratic attention cost of ViT-based encoders that nearly all current VLMs rely on. We explore an orthogonal question and investigate whether a vision encoder can produce compact and structured visual token representations that carry richer spatial concept information under a fixed token budget. We conduct the first controlled evaluation of SSM-based vision encoders as frozen visual backbones in VLMs and find that VMamba-based VLMs achieve stronger spatial concept understanding than ViT-family alternatives at matched scale and token budget, with improvements that transfer from grounding benchmarks to general visual question answering. Token-region similarity maps computed at intermediate layers of the language model show that SSM visual tokens produce sharper and more spatially selective concept-region binding during LLM reasoning, showing that the language model can use SSM representations better than ViT-based vision backbones under similar settings. These findings suggest that architectural inductive bias is an underexplored direction for improving visual concept representations in VLMs without increasing token budgets.
Submission Number: 15
Loading