High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Paul Pu Liang; Yiwei Lyu; Xiang Fan; Jeffrey Tsaw; Yudong Liu; Shentong Mo; Dani Yogatama; Louis-Philippe Morency; Russ Salakhutdinov

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, Russ Salakhutdinov

Published: 30 May 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Many real-world problems are inherently multimodal, from the communicative modalities humans use to express social and emotional states such as spoken language, gestures, and paralinguistics to the force, proprioception, and visual sensors ubiquitous on robots. While there has been an explosion of interest in multimodal representation learning, these methods are still largely focused on a small set of modalities, primarily in the language, vision, and audio space. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities. Since adding new models for every new modality or task becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? This paper proposes two new information theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar $2$ modalities $\{X_1,X_2\}$ are by measuring how much information can be transferred from $X_1$ to $X_2$, while (2) interaction heterogeneity studies how similarly pairs of modalities $\{X_1,X_2\}, \{X_3,X_4\}$ interact by measuring how much interaction information can be transferred from $\{X_1,X_2\}$ to $\{X_3,X_4\}$. We show the importance of these $2$ proposed metrics in high-modality scenarios as a way to automatically prioritize the fusion of modalities that contain unique information or unique interactions. The result is a single model, HighMMT, that scales up to $10$ modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and $15$ tasks from $5$ different research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and it transfers to entirely new modalities and tasks during fine-tuning. We release our code and benchmarks, which we hope will present a unified platform for subsequent theoretical and empirical analysis.

Submission Length: Regular submission (no more than 12 pages of main content)

Code: https://github.com/pliang279/HighMMT

Supplementary Material: zip

Assigned Action Editor: ~Brian_Kingsbury1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 694

Loading