From Graphs to Questions: A Framework for Complex Biomedical KGQA Dataset Generation

ACL ARR 2025 May Submission2757 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This work introduces BioGraphletQA, a novel large-scale dataset for complex biomedical Knowledge Graph Question Answering (KGQA) and describes the underlying generation framework. Central to our approach is the use of graphlets—small subgraphs extracted from a KG—as anchors for generating diverse and complex QA pairs using large language models (LLMs). Our pipeline comprises three stages: (1) KG preprocessing and reduction to produce a manageable subset; (2) an extensive prompt ablation study to identify the optimal prompt for QA generation; and (3) a filtering phase using an LLM to refine the dataset by removing low-quality pairs. The final dataset comprises 119,856 complex QA pairs, each linked to a graphlet containing up to five nodes. To assess quality, a domain expert annotated 53 QA pairs across five criteria, confirming the scientific validity, complexity, and completeness of the data. All code is available at: https://anonymous.4open.science/r/Synthetic-KGQA-CE2F.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Knowledge Graphs, Biomedical Question Answering, Synthetic Data, Large Language Models
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 2757
Loading