# SynthKGQA

Code to reproduce the results of the paper "Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs".

## SynthKGQA: KGQA dataset generation

### Steps 1-2: candidate generation
```bash
python3 synth_kgqa/generate.py --kg-path <path to KG directory> --num-samples <number of questions to generate> --num-edges <number of edges in answer subgraphs> --save-path <path to output>
```
for generating a set of question-answer graph pairs based on the provided knowledge graph. See [kgqa_dataset/parse.py](./kgqa_dataset/parse.py) for additional parameters.

### Steps 3-4: candidate validation, augmentation and classification
```bash
python3 synth_kgqa/process_qa.py --kg-path <path to wikiKG2 directory> --qa_path <path to output of generate.py>
```

The final data will be stored in `<save-path>/processed_qa.json`.

## GTSQA

The folder [GTSQA](GTSQA/) contains the train and test sets for the GTSQA dataset, constructed with SynthKGQA from the [ogbl-wikikg2](https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2) KG. 

## KG-RAG benchmarks

The script `compute_neighs_and_sp.py` is used to sample the question-specific subgraphs for training and test questions (appendix C.1) and the shortest path between seed and answer nodes, for the analysis in section 6.

The code to benchmark the KG-RAG models evaluated in the paper is available in [benchamrks/](benchmarks/)