Multimodal Entity Linking With Dynamic Modality Selection and Interactive Prompt Learning

Published: 2025, Last Modified: 08 Jan 2026IEEE Trans. Knowl. Data Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent advances in Multimodal Entity Linking leverage multimodal information to link target mentions to corresponding entities. However, existing methods uniformly adopt a “one-size-fits-all” approach, which overlooks the unique requirements of individual samples and fails to adequately balance modality-assisted disambiguation and modality-induced noise. Also, the commonly used separate large-scale visual and text pre-trained models for feature extraction do not address inter-modal heterogeneity and the high computational cost of fine-tuning. To resolve these two issues, we introduce a novel approach named Multimodal Entity Linking with Dynamic Modality Selection and Interactive Prompt Learning (DSMIP). First, we design three expert networks that utilize different subsets of modalities tailored to the task and train them individually. Specifically, for the multimodal expert network, we enhance entity and mention feature extraction by updating multimodal prompts and setting up a coupling function to realize the interaction of prompts between modalities. Subsequently, to select the best-suited expert network for each specific sample, we devise a Modality Selection Gating Network to gain the optimal one-hot selection vector by applying a specialized reparameterization technique and a two-stage training process. Experimental results on three public benchmark datasets demonstrate that the proposed DSMIP outperforms all state-of-the-art baselines.
Loading