Too Many Tokens, Too Little Focus? Rethinking Multimodal Attention with Soft Masking

Too Many Tokens, Too Little Focus? Rethinking Multimodal Attention with Soft Masking

ACL ARR 2025 February Submission3367 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite the remarkable success of Transformer-based self-attention in many domains, its effectiveness often diminishes in highly complex multimodal scenarios, where varying token granularities and long, noisy inputs can overwhelm the model. In this paper, we introduce Soft Token Attention Masking Process (STAMP), a novel soft-masking mechanism designed to prioritize the most relevant tokens across visual, audio, and textual streams. By refining attention maps globally, STAMP adapts each token’s contribution based on its contextual importance, preserving critical temporal and intermodal cues without discarding valuable information. We integrate STAMP into a multi-layer Transformer pipeline and thoroughly evaluate it on challenging video understanding datasets such as MADv2 and QVHighlights. Experimental results show that STAMP not only delivers significant performance gains but also offers a robust solution for complex multimodal tasks.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, video processing, cross-modal application

Contribution Types: Model analysis & interpretability, Reproduction study, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 3367

Loading