Towards Multimodal-augmented Pre-trained Language Models via Self-balanced Expectation-Maximization Iteration

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Pre-trained language models (PLMs) that rely solely on textual corpus may present limitations in multimodal semantics comprehension. Existing studies attempt to alleviate this issue by incorporating additional modal information through image retrieval or generation. However, these methods: (1) inevitably encounter modality gaps and noise; (2) treat all modalities indiscriminately; and (3) ignore visual or acoustic semantics of key entities. To tackle these challenges, we propose a novel principled iterative framework for multimodal-augmented PLMs termed MASE, which achieves efficient and balanced injection of multimodal semantics under the proposed Expectation Maximization (EM) based iterative algorithm. Initially, MASE utilizes multimodal proxies instead of explicit data to enhance PLMs, which avoids noise and modality gaps. In E-step, MASE adopts a novel information-driven self-balanced strategy to estimate allocation weights. Furthermore, MASE employs heterogeneous graph attention to capture entity-level fine-grained semantics on the proposed multimodal-semantic scene graph. In M-step, MASE injects global multimodal knowledge into PLMs through a cross-modal contrastive loss. Experimental results show that MASE consistently outperforms competitive baselines on multiple tasks across various architectures. More impressively, MASE is compatible with existing efficient parameter fine-tuning methods, such as prompt learning.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Engagement] Summarization, Analytics, and Storytelling
Relevance To Conference: We present a groundbreaking approach that alleviates the inherent limitations of text-only pre-trained language models (PLMs) in comprehending multimodal semantics. Our work is a highly relevant to the fields of the multimedia and multimodal processing field, as it introduces a novel framework, MASE, designed to efficiently and balancedly inject multimodal semantics into PLMs. MASE stands out by utilizing multimodal proxies (including image, video, and audio) to enhance PLMs under our iteration framework. This innovative strategy aligns with the MM community's emphasis on promoting research that is inherently multimedia or multimodal. Our experimental results demonstrate that MASE consistently outperforms competitive baselines across a variety of tasks. This versatility showcases the strength of MASE in handling diverse multimedia and multimodal tasks. Moreover, MASE's compatibility with efficient parameter fine-tuning methods, such as prompt learning, further enhances its applicability and relevance to the MM community. To sum up, our work adheres to the conference's focus on multimedia/multimodal research.
Supplementary Material: zip
Submission Number: 3692
Loading