TVOMNER: Textual and Visual Feature Optimization for Multimodal Named Entity Recognition

TVOMNER: Textual and Visual Feature Optimization for Multimodal Named Entity Recognition

ACL ARR 2025 May Submission2433 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Named Entity Recognition (MNER) aims to extract named entities from text by leveraging both textual and visual modalities. Although existing methods focus on enhancing cross-modal interaction or reducing the interference of irrelevant images, two major challenges remain: (1) the textual content is often short and informal, lacking sufficient context to accurately identify ambiguous or low-frequency entities; (2) fine-grained entity information in images that is relevant to the text is rarely utilized. To address these challenges, we propose TVOMNER, a novel framework that focuses on Textual and Visual feature Optimization for MNER. For textual optimization, the model retrieves external knowledge of candidate entities from Wikipedia and incorporates it into the original text to provide richer semantic context. For visual optimization, it integrates (a) heterogeneous text-guided features via a variational autoencoder (VAE), (b) global visual features generated by a visual encoder, and (c) fine-grained entity object-level visual features extracted by large language models (LLMs) and visual grounding (VG) models. These features are adaptively fused and integrated with the textual representation for a subsequent cross-modal attention mechanism and a dynamic gating module. Extensive experiments on the two widely used datasets show that TVOMNER outperforms all baselines and exhibits robust and competitive performance.

Paper Type: Long

Research Area: Information Extraction

Research Area Keywords: Information Extraction, Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 2433

Loading