Towards Holistic Multimodal Interaction: An Information-Theoretic Perspective

Zequn Yang; HaoTian Ni; Yake Wei; Di Hu

Towards Holistic Multimodal Interaction: An Information-Theoretic Perspective

Zequn Yang, HaoTian Ni, Yake Wei, Di Hu

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal learning, Information theory, Multimodal interaction

TL;DR: We analyze the influence of multimodal interaction on the multimodal learning, and propose a decomposition-based framework to distinguish the interactions and guide the learning.

Abstract: Multimodal interaction, which assesses whether information originates from individual modalities or their integration, is a critical property of multimodal data. The type of interaction varies across different tasks and subtly influences the effectiveness of multimodal learning, but it remains an underexplored topic. In this paper, we present an information-theoretic analysis to examine how interactions affect multimodal learning. We formulate specific types of information-theoretical interactions and provide theoretical evidence that an effective multimodal model necessity comprehensive learning across all interaction types. Moreover, we analyze two typical multimodal learning paradigms—joint learning and modality ensemble—and demonstrate that they both exhibit generalization gaps when faced with certain types of interactions. This observation underscores the need for a new paradigm that can isolate and enhance each type of interaction. To address this challenge, we propose the Decomposition-based Multimodal Interaction learning (DMI) paradigm. Our approach utilizes variation-based decomposition modules to segregate multimodal information into distinct types of disentangled interactions. Then, a new training strategy is developed to holistically enhance learning efficacy across various interaction types. Comprehensive empirical results indicate our DMI paradigm enhances multimodal learning by effectively decomposing and targeted improving the learning of interactions.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4463

Loading