KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

Belinda Mo; Kyssen Yu; Joshua Kazdan; Proud Mpala; Lisa Yu; Charilaos I. Kanatsoulis; Sanmi Koyejo

KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Charilaos I. Kanatsoulis, Sanmi Koyejo

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: knowledge graph, KG, synthetic data, data generation

TL;DR: KGGen is a text-to-knowledge-graph generator that uses language models to extract high-quality graphs from plain text.

Abstract: Recent interest in building foundation models for knowledge graphs has highlighted a fundamental challenge: knowledge graph data is scarce. The best-known knowl- edge graphs are primarily human-labeled, created by pattern-matching, or extracted using early NLP techniques. While human-generated knowledge graphs are in short supply, automatically extracted ones are of questionable quality. We present KGGen, a novel text-to-knowledge-graph generator that uses language models to extract high-quality graphs from plain text with a novel entity resolution approach that clusters related entities, significantly reducing the sparsity problem that plagues existing extractors. Unlike other KG generators, KGGen clusters and de-duplicates related entities to reduce sparsity in extracted KGs. Along with KGGen, we release Measure of Information in Nodes and Edges (MINE), the first benchmark to test an extractor’s ability to produce a useful KG from plain text. We benchmark our new tool against leading existing generators such as Microsoft’s GraphRAG; we achieve comparable retrieval accuracy on the generated graphs and better information re- tention. Moreover, our graphs exhibit more concise and generalizable entities and relations. Our code is open-sourced at https://github.com/stair-lab/kg-gen/.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 25968

Loading