Dual Stream Alignment with Hierarchical Bottleneck Fusion For Multimodal Sentiment Analysis

Dual Stream Alignment with Hierarchical Bottleneck Fusion For Multimodal Sentiment Analysis

ACL ARR 2024 June Submission4266 Authors

16 Jun 2024 (modified: 16 May 2025)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal sentiment analysis (MSA) leverages different modalities, such as text, image, and audio, for a comprehensive understanding of sentiment but faces challenges like temporal misalignment and modality heterogeneity. We propose a Dual-stream Alignment with Hierarchical Bottleneck Fusion (DAHB) method to address these issues. Our approach achieves comprehensive alignment through temporal alignment by cross-attention and semantic alignment via contrastive learning, ensuring alignment in time dimension and feature space. Moreover, Supervised contrastive learning is applied to refine these features. For modality fusion, we employ a hierarchical bottleneck method, progressively reducing bottleneck tokens to compress information and using bi-directional cross-attention to learn interactive between modalities. We conducted experiments on MOSI, MOSEI and CH-SIMS and results show that DAHB achieves state-of-the-art performance on a range of metrics. Ablation studies demonstrates the effectiveness of our methods. The code are available at url.

Paper Type: Long

Research Area: Computational Social Science and Cultural Analytics

Research Area Keywords: emotion detection and analysis

Contribution Types: NLP engineering experiment

Languages Studied: English,Chinese

Submission Number: 4266

Loading