Multimodal Graph-based Variational Mixture of Experts Network for Zero-shot Multimodal Information Extraction

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Web mining and content analysis
Keywords: multimodal information extraction, zero-shot learning, multimodal representation learning
Abstract: Multimodal information extraction on social media is a series of fundamental tasks to construct the multimodal knowledge graph. The tasks are defined to extract the structural information in free texts with the incorporate images, such as: multimodal named entity typing and multimodal relation extraction. However, the growing number of multimodal data implies a growing category set and the newly emerged entity types or relations should be recognized without additional training. To address the aforementioned disadvantages, we focus on the zero-shot multimodal information extraction task which requires to utilize textual and visual modalities for identifying previously unseen categories in a zero-shot manner. Compared with the text-based zero-shot information extraction models, the existing multimodal ones make the textual and visual modalities aligned directly and exploit various fusion strategies to improve their generalization ability. But the existing methods only align the global representations of multimodal data and ignore the fine-grained semantic correlation of the text-image pairs and samples. Therefore, we propose the multimodal graph-based variational mixture of experts network (MG-VMoE) which takes the MoE network as the backbone and exploits the sparse expert weights for aligning the multimodal representations in a fine-grained way. Considering to learn the informative and aligned representations of multimodal data, we design each expert network as a variational information bottleneck to process the two modalities in a uni-backbone. Moreover, we do not only model the correlation of the text-image pair inner a sample, but also propose the multimodal graph-based virtual adversarial training to learn the semantic correlation between the samples. The experimental results on the two benchmark datasets demonstrate the superiority of MG-VMoE over the baselines.
Submission Number: 308
Loading