Adaptive Token Selection and Fusion Network for Multimodal Sentiment Analysis

Published: 01 Jan 2024, Last Modified: 19 Jul 2025MMM (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal sentiment analysis aims to predict human sentiment polarity with multiple modalities. Most existing methods focus on directly integrating original modal features into multimodal fusion, ignoring the redundancy and heterogeneity across modalities. In this paper, we propose a simple but efficient Adaptive Token Selection and Fusion Network (ATSFN) to mitigate the effect of redundancy and heterogeneity. ATSFN employs adaptive trainable tokens to extract unimodal informative tokens and perform dynamic multimodal token fusion. Specifically, we first integrate critical information from original features into adaptive selection tokens through token selection transformers. Sentiment features flow through these smaller sequences of tokens to capture important information while reducing redundancy. Next, we introduce a token fusion transformer to fuse multimodal features dynamically. It adaptively estimates the unique contribution of each modality to sentiment tendencies through learnable fusion tokens. Experiments on two benchmark datasets demonstrate that our proposed approach achieves competitive performance and significant improvements.
Loading