Multimodal Entity Linking with Gated Hierarchical Fusion and Contrastive Training

Peng Wang, Jiangheng Wu, Xiaohang Chen

Published: 2022, Last Modified: 16 Aug 2024SIGIR 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Previous entity linking methods in knowledge graphs (KGs) mostly link the textual mentions to corresponding entities. However, they have deficiencies in processing numerous multimodal data, when the text is too short to provide enough context. Consequently, we conceive the idea of introducing valuable information of other modalities, and propose a novel multimodal entity linking method with gated hierarchical multimodal fusion and contrastive training (GHMFC). Firstly, in order to discover the fine-grained inter-modal correlations, GHMFC extracts the hierarchical features of text and visual co-attention through the multi-modal co-attention mechanism: textual-guided visual attention and visual-guided textual attention. The former attention obtains weighted visual features under the guidance of textual information. In contrast, the latter attention produces weighted textual features under the guidance of visual information. Afterwards, gated fusion is used to evaluate the importance of hierarchical features of different modalities and integrate them into the final multimodal representations of mentions. Subsequently, contrastive training with two types of contrastive losses is designed to learn more generic multimodal features and reduce noise. Finally, the linking entities are selected by calculating the cosine similarity between representations of mentions and entities in KGs. To evaluate the proposed method, this paper releases two new open multimodal entity linking datasets: WikiMEL and Richpedia-MEL. Experimental results demonstrate that GHMFC can learn meaningful multimodal representation and significantly outperforms most of the baseline methods.