Grounding QA Generation in Knowledge Graphs and Literature: A Scalable LLM Framework for Scientific Discovery

Published: 04 Mar 2025, Last Modified: 17 Apr 2025ICLR 2025 Workshop SynthDataEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic Q/A Dataset, Large Language Models, Knowledge Graphs, Biomedical Research
TL;DR: We introduce a framework merging precision medicine knowledge graphs with large language models to create synthetic QA datasets, advancing early-stage biomarker discovery and providing benchmarks for LLM evaluation in biomedical research.
Abstract: Therapeutic biomarkers are crucial in biomedical research and clinical decision-making, yet the field lacks standardized datasets and evaluation methods for complex, context-dependent questions. To address this, we integrate large language models (LLMs) with knowledge graphs (KGs) to filter PubMed abstracts, summarize biomarker contexts, and generate a high-quality synthetic Q/A dataset. Our approach mirrors biomarker scientists' workflows, decomposing question generation into classification, named entity recognition (NER), and summarization. We release a 24k high quality Q/A dataset and show through ablation studies that incorporating NER and summarization improves performance over using abstracts alone. Evaluating multiple LLMs, we find that while models achieve 96\% accuracy on multiple-choice questions, performance drops to 69\% on open-ended Q/A, highlighting the need for synthetic data to address the issue of novel discovery. By addressing a critical resource gap, this work provides a scalable tool for biomarker research and demonstrates AI’s broader potential in scientific discovery.
Submission Number: 58
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview