MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A modular attention-empowered Multimodal LLM that delving deep into the fine-grained cues for emotion understanding and cognition analysis.
Abstract: Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.
Lay Summary: Imagine trying to understand a scene where both text and images are involved — like reading a description of a movie while watching a scene. Current computer models, called multimodal large language models (MLLMs), can process both types of information. However, they often struggle to pay equal attention to the text and the images, causing them to miss important details. We identified a key issue in how these models mix the information from different sources, which affects their ability to understand complex tasks like recognizing emotions or making sense of complicated scenarios. To fix this, we introduced a new attention mechanism called MODA. It helps the model better focus on both the text and the images by improving how they interact and align with each other. Think of it like fine-tuning the way you read and watch at the same time to better understand the full picture. Our new approach has been tested on a variety of tasks, from recognizing emotions to understanding images and text, and it works better than previous methods. This improvement can help create smarter AI systems that can understand the world more like humans do, across different types of media.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: emotion, attention, multimodal large language model, large language model
Submission Number: 10155
Loading