A Multilevel Interaction Network Framework for Multimodal Entity Linking

Xiaoyu Jia, Minghua Nuo, Yao Wang, Yuan Zhang

Published: 2024, Last Modified: 24 Apr 2026NLPCC (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The task of Multimodal Entity Linking (MEL) aims to link ambiguous mentions in multimodal contexts to entities in a Multimodal Knowledge Graph. However, simple inter-modal interactions may result in deficiencies when processing multimodal data. To address this, we propose a novel Multilevel Interaction Network Framework (MINF) for MEL, comprehensively exploring both intra-modal and inter-modal interactions and integration. To capture fine-grained cues within individual modalities, we designed the Text to Text Interaction Unit (TTTU) and the Image to Image Interaction Unit (ITIU). For semantic correlations between different modalities, we introduced the Text to Image Fusion Interaction Unit (TIFU) and the Text to Image Cross Interaction Unit (TICU), enhancing multimodal feature interactions between entities and mentions. We also introduce independent loss functions for specific units to improve multimodal learning while preventing over-reliance on any single unit. Experimental results on three public benchmark datasets demonstrate that our proposed framework outperforms several state-of-the-art baseline methods, and ablation studies verify the effectiveness of designed modules.