Abstract: Recent advances in Multimodal Entity Linking (MEL) utilize multimodal information to link target mentions to corresponding entities. However, existing methods uniformly adopt a “one-size-fits-all” approach, ignoring individual sample needs and modality-induced noise. Also, the commonly used separate large-scale visual and text pre-trained models for feature extraction do not address inter-modal heterogeneity and the high computational cost of fine-tuning. To resolve these two issues, this paper introduces a novel approach named Multimodal Entity Linking with Dynamic Modality Selection and Interactive Prompt Learning (DSMIP). First, we design three expert networks that utilize different subsets of modalities to tackle the task and train them individually. In particular, for the multimodal expert network, we extract multimodal features of entities and mentions by updating multimodal prompts and set up a coupling function to realize the interaction of prompts between modalities. Subsequently, to select the best-suited expert network for each specific sample, we devise a Modality Selection Gating Network to gain the optimal one-hot selection vector by applying a specialized reparameterization technique and a two-stage training. Experimental results on three public benchmark datasets demonstrate that our solution outperforms the majority of state-of-the-art baselines and surpasses all baselines in settings with low training resources.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Data analysis
Languages Studied: English
0 Replies
Loading