Abstract: Despite the remarkable success of Transformer-based self-attention in many domains, its effectiveness often diminishes in highly complex multimodal scenarios, where varying token granularities and long, noisy inputs can overwhelm the model. In this paper, we introduce Soft Token Attention Masking Process (STAMP), a novel soft-masking mechanism designed to prioritize the most relevant tokens across visual, audio, and textual streams. By refining attention maps globally, STAMP adapts each token’s contribution based on its contextual importance, preserving critical temporal and intermodal cues without discarding valuable information. We integrate STAMP into a multi-layer Transformer pipeline and thoroughly evaluate it on challenging video understanding datasets such as MADv2 and QVHighlights. Experimental results show that STAMP not only delivers significant performance gains but also offers a robust solution for complex multimodal tasks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, video processing, cross-modal application
Contribution Types: Model analysis & interpretability, Reproduction study, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 3367
Loading