Light Up the Shadows: Enhance Long-Tail Entity Grounding with Concept-Guided Vision-Language ModelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We draw inspiration from the Triangle of Reference theory and propose to enhance the pre-trained visual-language models with concepts.
Abstract: Multi-Modal Knowledge Graphs (MMKGs) are a type of Knowledge Graph (KG) that integrates information from various modalities and holds significant application value. However, the construction of MMKGs often introduces mismatched images, i.e. noise. Due to the power-law distribution of images on the internet for entities, a large number of long-tail entities have very few images. Existing methods struggle to accurately identify images of long-tail entities. To address this issue, we draw inspiration from the Triangle of Reference theory and propose to enhance the pre-trained visual-language models with concepts. Specifically, we propose a two-stage framework containing two modules, i.e., Concept Integration and Evidence Fusion. The Concept Integration module aims to accurately recognize image-text pairs associated with long-tail entities, thereby improving MMKG quality. Additionally, our Evidence Fusion module can provide explainability regarding the results, which facilitates human verification, further enhancing long-tail entity grounding. Finally, we construct a dataset of 25k image-text pairs of long-tail entities. Comprehensive experiments show our method outperforms the baseline, achieving an average increase of about 20% in Mean Reciprocal Rank (MRR) in the ranking task and approximately 85% in F1 in the classification task.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment
Languages Studied: English
0 Replies

Loading