MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose MoMa, an efficient adapter that integrates spatial-temporal modeling into pre-trained image foundation models for improved video understanding, achieving better performance with lower computational cost.
Abstract: Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba's selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost. Codes will be released upon publication.
Lay Summary: Today’s AI models struggle to fully understand videos. While they excel at analyzing static images, they often treat what’s happening (like a ball flying) separately from where it’s happening (like a basketball court). This limits their ability to grasp complex actions, such as a gymnast’s fluid routine. We developed MoMa, an add-on tool that upgrades image-based AI to understand motion naturally. Inspired by how humans track movement, MoMa uses SeqMod—a technique that weaves together changes over time (temporal) and scene details (spatial) without overwriting the AI’s original knowledge. Its efficient "Divide-and-Modulate" design avoids heavy computations. MoMa outperformed existing methods on sports, surveillance, and action-recognition tasks while using 40% less computing power.
Primary Area: Applications->Computer Vision
Keywords: Video Recognition, PEFT, Mamba, Adapter
Submission Number: 202
Loading