Taxonomy-Driven Knowledge Graph Construction for Domain-Specific Scientific Applications

ACL ARR 2025 February Submission2512 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present a taxonomy-driven framework for constructing domain-specific knowledge graphs (KGs) that integrates structured taxonomies, Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). A key challenge in LLM-based extraction is weak annotations: noisy or misaligned entity/relationship labels diverge from expert-curated taxonomies. For instance, state-of-the-art generalist GLiNER model achieves only 0.339 F1 on climate science entity recognition, often omitting critical concepts or hallucinating entities. Our approach addresses these issues by anchoring the extraction process to verified taxonomies, enforcing entity constraints during LLM prompting and validating outputs via RAG. Through a climate science case study using our annotated dataset of 25 publications (1,705 entity links, 3,618 relationships), we demonstrate that taxonomy-guided LLM prompting combined with RAG-based validation reduces hallucinations by 23.3\% while improving F1 scores by 13.9\% compared to baselines without the proposed techniques. Our contributions include: 1) a generalizable methodology for taxonomy-aligned KG construction; 2) a reproducible annotation pipeline, 3) the first benchmark dataset for climate science information retrieval; and 4) empirical insights into combining structured taxonomies with LLMs for specialized domains. Code and data will be released upon acceptance.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: Information Extraction, Language Modeling, Machine Learning for NLP, Resources and Evaluation, NLP Applications
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 2512
Loading