Abstract: Audio-visual question answering (AVQA) aims to answer questions relevant to visual objects, sounds, and their relationships within videos [6, 10, 11]. This task delves into the complexities of multimodal scenes, which are both diverse and dynamic. The key challenges of AVQA lie in accurately identifying the audio and video segments that directly relate to the question and establishing whether the identified visual regions produce sounds relevant to the question. Prior research primarily leverages attention mechanisms to tackle these challenges. For instance, employing audio-guided visual attention helps localize sounding visual regions, while question-guided temporal attention aggregates relevant audio and visual segments [4–6]. Nonetheless, audio and visual segments don’t always correlate, and multimodal video segments can vary dynamically over time. Furthermore, when faced with lengthy sequences of audio-visual data accompanied by textual inputs, attention mechanisms may struggle to accurately discern the relationships across different modalities over extended durations. Consequently, the selection mechnism, driven by attention weights, may falter with long sequences, inadvertently aggregating irrelevant segments despite minimal weights.
The Mamba model [1] has proven its strength in modeling long sequences across diverse tasks. It dynamically adjusts the parameters of State Space Models (SSMs) [2] guided by the input, allowing for context-aware reasoning. Mamba’s unique capability to select and retain information indefinitely inspired us to expand its success story into the realm of multimodal video modeling. In this paper, we introduce CM-Mamba, a new extension that leverages a cross-modality selection mechanism within Mamba models. The model is designed to efficiently leverage information across audio, visual, and textual modalities, in which the parameters of SSMs will dynamically adjust in response to inputs from an alternative modality.
Loading