Exploring Fine-Grained Text-Image Alignment for Multimodal Aspect-based Sentiment Analysis

ACL ARR 2026 January Submission239 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ABSA, Multi-Modal Alignment, Image Segmentation
Abstract: As a multimodal task, there have been increasing works focused on the text-image alignment in Multimodal Aspect-Based Sentiment Analysis (MABSA). Yet they tend to rely on various black-box modules to self-learn the text-image alignment during training, either in attention-based, graph-based, or vision-language model approaches. Such implicit alignment cannot clearly state which parts of the image correspond to important textual phrases, thereby limiting support for cross-modal interaction. To this end, we are motivated to explore building explicit fine-grained alignment. We thus propose Interpretation-based Explicit Alignment (IEA), a framework that includes a Sentimental Image Interpreter and a Fine-grained Aligner. The former can effectively segment and interpret region-level semantics, while the latter can build explicit fine-grained alignment between regions and textual phrases. Extensive experiments establish new state-of-the-art performance on the Twitter2015 and Twitter2017 datasets, indicating the effectiveness of our alignment and revealing a new direction in cross-modal interaction.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: argument generation; argument mining;
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: English
Submission Number: 239
Loading