DiGTF: A Difference-Guided Two-Stage Fusion Framework for Multimodal Sentiment Analysis

Hui Liu, Minghua Nuo, Rui Li, Chengyi Zhou

Published: 2025, Last Modified: 24 Apr 2026NLPCC (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal Sentiment Analysis (MSA) aims to understand human emotions by combining information from different modalities. Despite recent advances, existing methods often struggle to effectively leverage task-relevant signals from non-verbal modalities, particularly in scenarios where textual semantics are ambiguous or incomplete. To address this issue, we propose DiGTF, a Difference-Guided Two-Stage Fusion Framework that enhances textual representations by integrating refined audio and video features. First, we introduce Disentangled Irrelevance Removal (DIR), which employs a dual cross-attention mechanism to disentangle audio-video representations into modality-invariant, sentiment-relevant, and task-irrelevant components, preserving the sentiment-relevant features. Then, we design a two-stage fusion strategy to enhance semantic representations. The first stage, Difference-Guided Fusion (DGF) adaptively incorporates cross-modal differences that align with sentiment cues into the textual features. The second stage, Multi-View Fusion (MVF) leverages a cross-scale attention mechanism to integrate the diverse fused representations and capture complex emotional patterns. Extensive experiments on three benchmark MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate that DiGTF achieves outstanding performance.