MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Thanh-Dat Truong; Christophe Bobda; Nitin Agarwal; Khoa Luu

MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Thanh-Dat Truong, Christophe Bobda, Nitin Agarwal, Khoa Luu

Published: 18 Sept 2025, Last Modified: 21 Apr 2026NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Fusion, Multimodal Learning, Normalizing Flows

TL;DR: This paper introduces a novel Multimodal Attention-based Normalizing Flow approach to developing explicit, interpretable, and tractable multimodal fusion learning

Abstract: Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 19767

Loading