Beyond Images - Are a Thousand Words Better Than a Single Picture? A Framework for Multi-Modal Knowledge Graph Dataset Enrichment

Beyond Images - Are a Thousand Words Better Than a Single Picture? A Framework for Multi-Modal Knowledge Graph Dataset Enrichment

ACL ARR 2025 February Submission1463 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multi-Modal Knowledge Graphs (MMKGs) enhance entity representations by incorporating text, images, audio, and video, offering a more comprehensive understanding of each entity. Among these modalities, images are especially valuable due to their rich content and the ease of large-scale collection. However, many images are semantically unclear, making it challenging for the models to effectively use them to enhance entity representations. To address this, we present the Beyond Images framework, which generates textual descriptions for entity images to more effectively capture their semantic relevance to the associated entity. By adding textual descriptions, we achieve up to 5\% improvement in Hits@1 for link prediction task across three MMKG datasets. Furthermore, our scalable framework reduces the need for manual construction by automatically extending three MMKG datasets with additional images and their descriptions. Our work highlights the importance of textual descriptions for MMKGs. Our code and enriched datasets are publicly available at https://anonymous.4open.science/r/Beyond-Images-2266

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: knowledge base construction, cross-modal information extraction, multimodality, data augmentation, cross-modal content generation

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 1463

Loading