InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing

Zhihong Zhu; Xuxin Cheng; Zhaorun Chen; Yuyan Chen; Yunyan Zhang; Xian Wu; Yefeng Zheng; Bowen Xing

InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing

Zhihong Zhu, Xuxin Cheng, Zhaorun Chen, Yuyan Chen, Yunyan Zhang, Xian Wu, Yefeng Zheng, Bowen Xing

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multi-modal intent detection (MID) aims to comprehend users' intentions through diverse modalities, which has received widespread attention in dialogue systems. Despite the promising advancements in complex fusion mechanisms or architecture designs, challenges remain due to: (1) various noise and redundancy in both visual and audio modalities and (2) long-tailed distributions of intent categories. In this paper, to tackle the above two issues, we propose InMu-Net, a simple yet effective framework for MID from the $\textbf{In}$formation bottleneck and $\textbf{Mu}$lti-sensory processing perspective. Our contributions lie in three aspects. First, we devise a $\textit{denoising bottleneck module}$ to filter out the intent-irrelevant information in the fused feature; Second, we introduce a $\textit{saliency preservation loss}$ to prevent the dropping of intent-relevant information; Ultimately, $\textit{kurtosis regulation}$ is introduced to maintain representation smoothness during the filtering process, mitigating the adverse impact of the long tail distribution. Comprehensive experiments on two MID benchmark datasets demonstrate the effectiveness of InMu-Net and its vital components. Impressively, a series of analyses reveal our denoising potential and robustness in low-resource, modality corruption, cross-architecture and cross-task scenarios.

Primary Subject Area: [Experience] Multimedia Applications

Secondary Subject Area: [Content] Multimodal Fusion

Relevance To Conference: This work is highly relevant to the fields of the multimedia and multimodal processing field, as it presents a novel framework from the information bottleneck and multi-sensory processing perspectives, to tackle modality redundancy and long-tailed distribution of labels in multi-modal intent detection jointly. Extensive experiments in different scenarios including low-resource, modality corruption, cross-architecture and cross-task demonstrate the superiority of our framework. To sum up, our work adheres to the conference's focus on multimedia/multimodal research.

Supplementary Material: zip

Submission Number: 5105

Loading