VP-MEL: Visual Prompts Guided Multimodal Entity Linking

ACL ARR 2024 December Submission183 Authors

10 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Multimodal entity linking (MEL), a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB), has attracted much attention due to its wide applications in recent years. However, existing MEL methods often rely heavily on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text. This reliance poses significant challenges in scenarios where mention words are absent, as current MEL approaches struggle to leverage image-text pairs for accurate entity linking. To solve these issues, we introduce a Visual Prompts guided Multimodal Entity Linking (VP-MEL) task. Given a text-image pair, VP-MEL aims to link a marked region (i.e., visual prompt) in an image to its corresponding entities in the knowledge base. To facilitate this task, we present a new dataset, VPWiki, specifically designed for VP-MEL. Furthermore, we propose a framework named FBMEL, which enhances visual feature extraction using visual prompts and leverages the pretrained Detective-VLM model to capture latent information. Experimental results on the VPWiki dataset demonstrate that FBMEL outperforms baseline methods across multiple benchmarks for the VP-MEL task. Code and datasets will be released at https://anonymous.4open.science/r/VP-MEL-26E2.

Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: entity linking/disambiguation; cross-modal information extraction;
Languages Studied: English
Submission Number: 183
Loading