Aspects are Anchors: Towards Multimodal Aspect-based Sentiment Analysis via Aspect-driven Alignment and Refinement
Abstract: Given coupled sentence image pairs, Multimodal Aspect-based Sentiment Analysis (MABSA) aims to detect aspect terms and predict their sentiment polarity. While existing methods have made great efforts in aligning images and text for improved MABSA performance, they still struggle to effectively mitigate the challenge of the noisy correspondence problem (NCP): the text description is often not well-aligned with the visual content. To alleviate NCP, in this paper, we introduce Aspect-driven Alignment and Refinement (ADAR), which is a two-stage coarse-to-fine alignment framework. In the first stage, ADAR devises a novel Coarse-to-fine Aspect-driven Alignment Module, which introduces Optimal Transport (OT) to learn the coarse-grained alignment between visual and textual features. Then the adaptive filter bin is applied to remove the irrelevant image regions at a fine-grained level; In the second stage, ADAR introduces an Aspect-driven Refinement Module to further refine the cross-modality feature representation. Extensive experiments on two benchmark datasets demonstrate the superiority of our model over state-of-the-art performance in the MABSA task.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This work significantly contributes to the field of multimedia and multimodal processing by introducing a novel framework, Aspect-driven Alignment and Refinement (ADAR), specifically designed for Multimodal Aspect-based Sentiment Analysis (MABSA). The framework addresses the challenging Noisy Correspondence Problem (NCP) by leveraging Optimal Transport (OT) to align visual and textual features at a coarse-grained level and an Adaptive Filter Bin (AFB) for fine-grained noise reduction. Furthermore, the Aspect-driven Refinement Module (ADRM) enhances cross-modality feature representation for improved sentiment prediction. These innovations are particularly relevant to ACM Multimedia (ACM MM), a conference that fosters research and development in multimedia computing and its applications. The proposed ADAR model not only pushes the boundaries of MABSA but also provides a robust and efficient solution for handling complex multimodal data, which is a key area of interest for the ACM MM community.
Supplementary Material: zip
Submission Number: 2951
Loading