R3DG: Retrieve, Rank, and Reconstruction with Different Granularities for Multimodal Sentiment Analysis

Yan Zhuang, Yanru Zhang, Jiawen Deng, Fuji Ren

Published: 01 Jan 2025, Last Modified: 23 Nov 2025ResearchEveryoneRevisionsCC BY-SA 4.0
Abstract: Multimodal sentiment analysis (MSA) aims to assess emotional states by integrating information from text, audio, and video. However, the heterogeneous nature of these modalities presents substantial challenges for accurate sentiment prediction. Existing approaches typically align pairs of modalities using attention mechanisms or contrastive learning, which are computationally expensive. Additionally, they often rely on a single granularity of alignment, either by averaging features over all time steps or aligning features at each individual time step. These approaches overlook the fact that emotional expression can vary across individuals and contexts, requiring multiple granularities to capture emotion effectively. To address these challenges, we propose a novel framework, Retrieve, Rank, and Reconstruction with Different Granularities (R3DG). R3DG segments the audio and video modalities into multiple representations at varying granularities based on their temporal durations. It then selects the most relevant representations that align closely with the text modality. To preserve the original information, R3DG reconstructs the audio and video data using the selected representations. Finally, the fused audio, video, and text features are aligned and combined for sentiment prediction, reducing the need for multiple alignment steps. Extensive experiments on 5 benchmark MSA datasets demonstrate that R3DG outperforms existing methods and achieves substantial reductions in computational time. Code is available at https://github.com/YetZzzzzz/R3DG.
Loading