Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

ACL ARR 2024 June Submission2605 Authors

15 Jun 2024 (modified: 14 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on large language models (LLMs) has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schema easily exceed the LLMs' context window length. Furthermore, there are scenarios where a fixed pre-defined schema is not available and we would like the method to construct an intrinsically high-quality KG with accurate information and a succinct self-generated schema. To address these problems, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information Extraction (OIE) followed by schema definition and post-hoc Canonicalization. EDC is flexible in that it can be applied to settings where a target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works. To improve performance, we further introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: knowledge base construction, zero/few-shot extraction, named entity recognition and relation extraction
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2605
Loading