Multimodal Contextual Interactions of Entities: A Modality Circular Fusion Approach for Link Prediction
Abstract: Link prediction aims to infer missing valid triplets to complete knowledge graphs, with recent inclusion of multimodal information to enrich entity representations. Existing methods project multimodal information into a unified embedding space or learn modality-specific features separately for later integration. However, performance was limited in such studies due to neglecting the modalities compatibility and conflict semantic carried by entities in valid and invalid triplets. In this paper, we aim at modeling inter-entity modality interactions and thus propose a novel modality circular fusion approach (MoCi), which interweaves multimodal contextual of entities. Firstly, unlike most methods in this task that directly fuse modalities, we design a triplets-prompt modality contrastive pre-training to align modality semantics beforehand. Moreover, we propose a modality circular fusion model using a simple yet efficient multilinear transformation strategy. This allows explicit inter-entity modality interactions, distinguishing it from methods confined to fuse within individual entities. To the best of our knowledge, MoCi presents one of the pioneering frameworks that tailored to grasp inter-entity modality semantics for better link prediction. Extensive experiments on seven datasets demonstrate our model yields SOTA performance, confirming the efficacy of MoCi in modeling inter-entity modality interactions. Our code is released at https://github.com/MoCiGitHub/MoCi.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: This work contributes to the advancement of multimedia/multimodal processing by proposing a novel modality circular fusion approach.
Firstly, we propose a multimodal fusion method for link prediction tasks. This approach, by integrating multimodal data (such as text and images) from different entities, aims to improve the accuracy and efficiency of link prediction.
Secondly, our method offers flexibility in the selection of modal data, allowing for the inclusion or expansion to multimedia data, such as video and audio, thereby further enriching entity representations.
Ultimately, our work establishes a critical bridge between the enhancement of knowledge graphs and the fields of multimedia and multimodal processing, providing a novel perspective and methodology.
Submission Number: 5668
Loading