Mixture of Multimodal Interaction Experts

Published: 02 Nov 2023, Last Modified: 18 Dec 2023UniReps PosterEveryoneRevisionsBibTeX
Keywords: multimodal machine learning; mixture of experts
Abstract: Multimodal machine learning, which studies the information and interactions across various input modalities, has made significant advancements in understanding the relationship between images and descriptive text. Yet, this is just a portion of the potential multimodal interactions in the real world, such as sarcasm in conflicting utterance and gestures. Notably, the current methods for capturing this shared information often don't extend well to these more nuanced interactions. Current models, in fact, show particular weaknesses with disagreement and synergistic interactions, sometimes performing as low as 50\% in binary classification. In this paper, we address this problem via a new approach called mixture of multimodal interaction experts. This method automatically classifies datapoints from unlabeled multimodal dataset by their intereaction types, then employs specialized models for each specific interaction. Based on our experiments, this approach has improved performance on these challenging interactions to more than 10%, leading to an overall increase of 2% for tasks like sarcasm prediction. As a result, not only does interaction quantification provide new insights for dataset analysis, but also simple approaches to obtain state-of-the-art performance.
Track: Extended Abstract Track
Submission Number: 80