Toward Compact and Structured Visual Representations in VLMs: SSM-Based Vision Encoders as an Alternative to Transformers

Shang-Jui Ray Kuo; Paola Cascante-Bonilla

Toward Compact and Structured Visual Representations in VLMs: SSM-Based Vision Encoders as an Alternative to Transformers

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

Published: 24 Apr 2026, Last Modified: 01 Jun 2026VisCon 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-language model, Vision state-space model

TL;DR: We show that SSM-based vision encoders produce more compact and spatially structured visual tokens than ViT-family alternatives, improving both grounding and general VQA in vision-language models without increasing token budgets.

Abstract: Spatial concept understanding is important for vision-language models (VLMs), not only for explicit grounding and localization tasks but also for general visual question answering. The common solution to this challenge has been to scale image resolution or visual token counts, yet this conflicts with the quadratic attention cost of ViT-based encoders that nearly all current VLMs rely on. We explore an orthogonal question and investigate whether a vision encoder can produce compact and structured visual token representations that carry richer spatial concept information under a fixed token budget. We conduct the first controlled evaluation of SSM-based vision encoders as frozen visual backbones in VLMs and find that VMamba-based VLMs achieve stronger spatial concept understanding than ViT-family alternatives at matched scale and token budget, with improvements that transfer from grounding benchmarks to general visual question answering. Token-region similarity maps computed at intermediate layers of the language model show that SSM visual tokens produce sharper and more spatially selective concept-region binding during LLM reasoning, showing that the language model can use SSM representations better than ViT-based vision backbones under similar settings. These findings suggest that architectural inductive bias is an underexplored direction for improving visual concept representations in VLMs without increasing token budgets.

Submission Number: 15

Loading