Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling

Ming Hou; Jiajia Tang; Jianhai Zhang; Wanzeng Kong; Qibin Zhao

Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling

Ming Hou, Jiajia Tang, Jianhai Zhang, Wanzeng Kong, Qibin Zhao

06 Sept 2019 (modified: 05 May 2023)NeurIPS 2019Readers: Everyone

Abstract: The multimodal fusion is at the core of multimodal research, with the object of producing better multimodal feature representations from heterogeneous data modalities. Tensor-based fusion techniques have exhibited great potentiality in boosting the performance of the multimodal prediction. Despite of being simple and compact, the existing tensor fusion approaches only consider bilinear or trilinear pooling, which is insufficient to fully capture the complicated correlations among modalities, because those models fail to unleash the complete expressive power of multilinear fusion with restricted orders of interactions. More importantly, they focus on simply fusing multimodal features all at once in a global manner. As a result, the complex local dynamics of interactions cannot be grasped, leading to significant deterioration of the prediction. In this work, we first propose a high-order polynomial multilinear pooling unit as a local fusion block. Building upon this, we establish a deep multimodal fusion architecture which can flexibly fuse the mixed features across both temporal and modality domains. The proposed modal is good at revealing much more complex temporal-modality correlations at both local and global scales, and is shown to possess an equivalent expressivity capacity as a very deep convolutional arithmetic circuits. The various experiments on real multimodal datasets demonstrate its state-of-the-art predictive performance.

Code Link: https://github.com/jiajiatang0000/HPFN

CMT Num: 6542

1 Reply

Loading