KE-UMNER: Knowledge-Enriched Urdu Multimodal Named Entity Recognition Using LLM and Vision-Language Integration

ACL ARR 2025 May Submission1608 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal Named Entity Recognition (MNER) focuses on identifying entities of predefined categories within text by utilizing information from multiple sources, primarily text and images. While this task has seen progress in high-resource languages, it remains challenging for low-resource settings like Urdu, where social media content is often short, informal, and ambiguous. To address this, we propose KE-UMNER, a knowledge-enriched MNER framework that augments multimodal input with external semantic knowledge. It leverages Large Language Models to generate entity-specific contextual knowledge and employs a vision-language model (BLIP) to produce natural language captions from images. These knowledge signals are integrated with the input through a cross-modal attention mechanism and decoded via a BiLSTM-CRF layer for sequence labeling. Experiments on the Twitter2015-Urdu dataset show that KE-UMNER achieves a 12.08\% absolute improvement in F1-score over prior state-of-the-art models. Ablation studies confirm the contribution of external knowledge sources, and case analyses demonstrate improved disambiguation in noisy, low-resource contexts.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: named entity recognition, multilingual extraction, knowledge-enriched NER, low-resource NLP, multimodal NER
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English, Urdu
Submission Number: 1608
Loading