MMOE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

ACL ARR 2024 June Submission4223 Authors

16 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today's models mainly focus on the correspondence between images and text, using this for tasks like image captioning and image-text retrieval. However, this covers only a subset of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or figurative descriptions of images, remain challenging. In this paper, we introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE). The key idea in MMoE is to train separate expert models for each type of interaction, such as redundancy present in both modalities, uniqueness in one modality, or varying degrees of synergy that emerge when both modalities are fused. On two multimodal sarcasm datasets, we obtain new state-of-the-art results. MMoE also provides the opportunity to design smaller specialized experts, and improves the transparency of the modeling process.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodal machine learning; multimodal interaction; sarcasm detection
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4223
Loading