Abstract: Multimodal Named Entity Recognition (MNER) aims to extract named entities from text by leveraging both textual and visual modalities. Although existing methods focus on enhancing cross-modal interaction or reducing the interference of irrelevant images, two major challenges remain: (1) the textual content is often short and informal, lacking sufficient context to accurately identify ambiguous or low-frequency entities; (2) fine-grained entity information in images that is relevant to the text is rarely utilized. To address these challenges, we propose TVOMNER, a novel framework that focuses on Textual and Visual feature Optimization for MNER. For textual optimization, the model retrieves external knowledge of candidate entities from Wikipedia and incorporates it into the original text to provide richer semantic context. For visual optimization, it integrates (a) heterogeneous text-guided features via a variational autoencoder (VAE), (b) global visual features generated by a visual encoder, and (c) fine-grained entity object-level visual features extracted by large language models (LLMs) and visual grounding (VG) models. These features are adaptively fused and integrated with the textual representation for a subsequent cross-modal attention mechanism and a dynamic gating module. Extensive experiments on the two widely used datasets show that TVOMNER outperforms all baselines and exhibits robust and competitive performance.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: Information Extraction, Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2433
Loading