Unleashing the Power of Selective State Space Models in Vision-Language Models

25 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models; Mamba;
Abstract: While emerging multi-modal large language models (MLLM) have demonstrated impressive advances, the quadratic complexity of their Transformer-based LLMs (3B or larger) inevitably leads to considerable computational overhead. On the other hand, the recently proposed selective state space model (i.e., Mamba) enjoys both model capacity and computational efficiency, making it an ideal component to enhance MLLM's efficiency and performance. However, recent attempts to introduce Mamba into MLLMs simply replace their LLMs with Mamba, ignoring the unique characteristics of either side. We argue that such a naive combination cannot exhibit the potential of Mamba in MLLMs. In this paper, we delve into harnessing Mamba's unique properties, and propose tailored designs from both multi-modal input and architectural perspectives to unleash its true power. First, we fully utilize Mamba's linear complexity to construct visual long sequences for a thorough perception at a minor efficiency burden. To integrate the scanning mechanism with the built visual long sequence, we devise a novel cross-stitch scanning approach to capture and fuse spatial and semantic properties simultaneously, enhancing the interaction of visual information and the vision-language alignment. Built upon these designs, we propose MambaVLM, a simple yet effective MLLM framework that exhibits highly competitive results across multiple benchmarks. Moreover, our framework is also compatible with Transformer-based LLMs (e.g., Vicuna), demonstrating remarkable training and inference efficiency. Notably, with only 0.66M data and 14 hours training on a single A800 node, our MambaVLM outperforms LLaVA-1.5 by significant margins and performs on par or even better than the 1.4B data trained Qwen-VL. The appealing results from both effectiveness and efficiency aspects indicate the promising prospects of Mamba in MLLMs.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4294
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview