ThinkLinker: From Low-Rank Interaction to Knowledge-Aware Verification for Multimodal Entity Linking

ThinkLinker: From Low-Rank Interaction to Knowledge-Aware Verification for Multimodal Entity Linking

ACL ARR 2026 January Submission3664 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Entity Linking, MultiModal, Large Language Model

Abstract: Recent advances in Multimodal Entity Linking (MEL) exploit textual and visual information to disambiguate mentions and align them with entities in a knowledge base. Existing methods typically design separate and complex network modules for each type of interaction among multi-granular and multimodal features, while lacking explicit modeling of the joint dependencies among these features. Moreover, most approaches rely on unidirectional retrieval-based matching and lack knowledge-driven verification, leading to unreliable disambiguation in weak-context scenarios. To address these challenges, we propose a novel two-stage MEL framework termed ThinkLinker. First, we introduce a low-rank fusion mechanism to model the joint dependencies among multi-granular and multimodal features, enabling comprehensive and explicit interactions while learning task-relevant discriminative information for candidate ranking in a lower-dimensional space. Subsequently, we develop a bidirectional retrieval-verification paradigm, where the ranked candidate entities guide an LLM-based multi-turn, dialogue-style verification process to generate mention-specific contextual augmentation. The augmented context is then adaptively fused with the original representation to further refine the linking model. Experimental results on public benchmark datasets demonstrate that the proposed ThinkLinker outperforms all state-of-the-art baselines. The code is publicly available at https://anonymous.4open.science/r/ThinkLinker-D443.

Paper Type: Long

Research Area: Information Extraction and Retrieval

Research Area Keywords: Information Extraction; Multimodality and Language Grounding to Vision, Robotics and Beyond; Machine Learning for NLP

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3664

Loading