EaNet: Enhanced Multimodal Awareness Alignment Network for Multimodal Aspect-Based Sentiment Analysis
Abstract: Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to identify aspect-sentiment pairs from both text and images. Although notable progress has been made in aligning aspects with visual content, the implicit and subtle nature of language often leads to the absence of explicit aspect terms, making alignment challenging. Existing methods typically adopt a coarse strategy that aligns the entire image with aspect, introducing noise from irrelevant or overlapping regions. Furthermore, different image regions may correspond to different textual aspects, causing sentiment signals to interfere with each other. To tackle these issues, we propose the Enhanced Multimodal Awareness Alignment Network (EaNet), which enables fine-grained aspect-region alignment while mitigating cross-modal interference. EaNet first uses a modality-adaptive encoder to preserve intra-modal features and suppress irrelevant signals, then applies applies aspect-aware and sentiment-aware modules to jointly improve alignment and denoising. To further improve the model's understanding of multimodal sentiment patterns and aspect-opinion semantics, we design four targeted pre-training tasks. In particular, to address implicit aspect scenarios arising from concise textual expressions, we introduce a large language model-guided module for implicit aspect-opinion generation. Experiments on three MABSA subtasks show that EaNet achieves state-of-the-art performance.
External IDs:doi:10.1109/taffc.2025.3612991
Loading