Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from text-image pairs in social media posts. Existing methods for JMERE require large amounts of labeled data. However, gathering and annotating fine-grained multimodal data for JMERE poses significant challenges. Initially, we construct diverse and comprehensive multimodal few-shot datasets tailored to the data distribution. To address the insufficient information in the few-shot setting, we introduce the \textbf{K}nowledge-\textbf{E}nhanced \textbf{C}ross-modal \textbf{P}rompt \textbf{M}odel (KECPM) for JMERE. This method can effectively address the problem of insufficient information in few-shot setting by guiding a large language model to generate supplementary background knowledge. Our proposed method comprises two stages: (1) a knowledge ingestion stage that dynamically formulates prompts based on semantic similarity and employs self-reflection to refine the knowledge generated by ChatGPT; (2) a knowledge-enhanced language model stage that merges the auxiliary knowledge with the original input and utilizes a transformer-based model to execute JMERE tasks. We extensively evaluate our approach on a few-shot dataset derived from the JMERE dataset, demonstrating its superiority over strong baselines in terms of both micro and macro F$_1$ scores. Additionally, we present qualitative analyses and case studies to elucidate the effectiveness of our model.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Social Aspects of Generative AI
Relevance To Conference: We focus on joint multimodal entity-relation extraction (JMERE) in the few-shot scenario. To our best knowledge, this is the first attempt to investigate the few-shot JMERE (FS-JMERE) with quite limited supervised data. The construction of our few-shot dataset accounts for the distribution of relation categories We propose a knowledge-enhanced cross-modal prompt model to address the few-shot scenario, utilizing dynamic prompts and knowledge reflection, to obtain background knowledge from LLMs for improved performance in low-shot scenarios within JMERE. Our approach enables the creation of contextually appropriate prompts, guiding LLMs to generate refined auxiliary knowledge efficiently. Additionally, through knowledge reflection, we enhance relevance by iteratively selecting generation knowledge from LLMs. We perform comprehensive experiments on the constructed few-shot datasets, and our results conclusively illustrate that our proposed model outperforms strong baselines in the JMERE within the few-shot setting.
Supplementary Material: zip
Submission Number: 1638
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview