KRAQ: Optimizing Retrieval-Augmented Generation with Knowledge Graph-Based Questions

ACL ARR 2025 July Submission100 Authors

22 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Retrieval-Augmented Generation (RAG) systems face significant challenges in retrieval diversity and inference latency, limiting their effectiveness in practical scenarios. We introduce $ \textit{KRAQ} $, an innovative approach that employs corpus-derived knowledge graphs to generate high-quality representative questions. These precomputed questions enhance retrieval diversity by serving as diverse retrieval alternatives and reduce inference latency by enabling offline pre-computation of embeddings. Implemented within two practical RAG variants—$ \textit{Combined Retrieve RAG} $ and $ \textit{Efficient Speculative RAG} $—$ \textit{KRAQ} $ substantially outperforms competitive baselines by up to 48.7 points, achieves accuracy gains of up to 3\%, and reduces inference latency by as much as 11.8\%. Our results demonstrate $ \textit{KRAQ's} $ potential as a scalable, robust optimization for improving the performance of RAG systems.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Retrieval-Augmented Generation, Knowledge Graphs, Question Generation, Question Answering, Information Retrieval
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Theory
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: We discuss potential risks in the "Ethical Considerations" section. This section addresses three key areas: 1) Bias Amplification, acknowledging that KRAQ can reflect and amplify biases present in the source corpus; 2) Potential for Misuse, such as applying the framework to generate misleading questions from disinformation corpora; and 3) Data Privacy risks when the method is used on sensitive datasets.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Yes, we cite the creators for all artifacts used. Section 4.1 ("Datasets") cites the papers for the four evaluation benchmarks. Section 4.4 ("Implementation Details") cites the sources for the language models (LLaMA 3.1-8B-Instruct), embedding models (nomic-embed-text, InBedder-RoBERTa), and key frameworks and tools (vLLM, GraphRAG, Qdrant). Appendix A.1 cites the sources for the datasets used in fine-tuning (Dolly-v2, MusiQue).
B2 Discuss The License For Artifacts: No
B2 Elaboration: For brevity, we did not discuss the specific licenses of the artifacts in the paper. However, all tools, models, and datasets used are publicly available under licenses that permit academic research. The new artifact we created (the fine-tuned KRAQ generator model) is mentioned in a footnote on page 2, stating our intention to make it available upon publication, which will be under a permissive open-source license.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Yes, our use of all existing artifacts is fully consistent with their intended purpose. As described in Section 4.1, we use standard QA benchmarks for their designed task of evaluating question-answering systems. As detailed in Section 4.4, language models and embedding models are used for their respective intended functions of generation and semantic representation. Our created artifact, the fine-tuned KRAQ generator, is used within our research context, and we state our intention to release it for future research use in the footnote on page 3. Regarding GraphRAG, our use was also consistent with its intended function: we employed the framework to build a knowledge graph and generate community summaries. While the GraphRAG authors primarily use these summaries for direct question answering, we repurposed them for a novel downstream task—as input for our question generator. This constitutes a new application of its outputs, not a deviation from the framework's intended use.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We did not discuss specific checks for PII or offensive content because our work relies exclusively on well-established, public academic datasets (TriviaQA, HotPotQA, etc.) that are standard benchmarks in the NLP community and are presumed to have been curated for public research.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Yes, documentation is provided in Section 4.1, where we specify the domain for each dataset (e.g., "open-domain" for TriviaQA, "biomed" for BioASQ). The paper implicitly indicates that all work was conducted in English. Broader implications regarding data representation are discussed in the "Ethical Considerations" section concerning bias amplification.
B6 Statistics For Data: Yes
B6 Elaboration: Yes, we report relevant statistics. In Section 4.1, we specify that each benchmark corpus was sampled to a size of approximately 5 million unique tokens. In Appendix A.1, we specify the source datasets used for creating our fine-tuning data and their nature (e.g. Dolly-v2).
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Yes, this is reported in Section 4.4 "Implementation Details". We specify the model size (LLaMA 3.1-8B-Instruct) and the computing infrastructure (a single NVIDIA RTX 3090 GPU with 24GB VRAM). While total GPU hours were not explicitly reported, the provided information allows for an estimation of the computational budget.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Yes. Section 4.4 provides the specific hyperparameter configurations for both Combined Retrieve RAG and Speculative RAG for each dataset. Appendix A.2 lists the key hyperparameters for the QLoRA fine-tuning process. Appendix B presents a study on the GraphRAG prompt tuning configuration, justifying our choice of parameters.
C3 Descriptive Statistics: Yes
C3 Elaboration: Yes, our results are reported as summary statistics over sets of experiments. Accuracy metrics (EM, LLM-Judge) are fractions or percentages over the entire evaluation set for each benchmark. As stated in Section 4.2, latency is reported as the median wall-clock time per query, which is a robust descriptive statistic.
C4 Parameters For Packages: Yes
C4 Elaboration: Yes. In Section 4.4, we specify the configuration for the GraphRAG framework, including chunk size and overlap. In Appendix A.2, we detail the specific QLoRA hyperparameters used for fine-tuning via the PEFT library.
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: AI assistants were used to aid in code debugging and improving the grammar and clarity of the manuscript. The core research ideas, experimental design, and final analysis were conducted by the authors.
Author Submission Checklist: yes
Submission Number: 100
Loading