Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

ACL ARR 2025 February Submission5428 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG). However, existing KG extraction methods predominantly rely on prompt-based approaches, which are inefficient for processing large-scale corpora and suffer from information loss with long documents. Additionally, methods and datasets for evaluating ontology-free KG construction are lacking. To address these shortcomings, we propose SynthKG, a multi-step, document-level ontology-free KG synthesis workflow. By further fine-tuning a smaller LLM on synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality (including models up to eight times larger) but also consistently excels in retrieval and question-answering tasks. Additionally, our proposed graph retrieval framework outperforms all KG-retrieval methods across multiple benchmark datasets. We make SynthKG and Distill-SynthKG publicly available.
Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: Knowledge Graph, RAG, Synthetic Data Generation, Knowledge Distillation
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5428
Loading