Keywords: Question Answering, Synthetic Data, Retrieval Augmented Generation, Natural Language Processing
TL;DR: We study the degree to which synthetic data can effectively substitute real data for RAG generator fine-tuning.
Abstract: To improve large language models (LLMs) for question answering (QA) tasks, system architects often look to retrieval-augmented generation (RAG) or fine-tuning approaches to increase a model's performance. In many applications, however, there is a dearth of real data of sufficient quality to support model fine-tuning to improve RAG system performance for QA tasks. In this work, we study the degree to which synthetic data can effectively substitute real data for RAG generator fine-tuning. Using GPT-4o, we generate a synthetic version of the HotpotQA training set and fine-tune a Llama-3 generator separately on both real and synthetic data. We evaluate our models with a range of metrics such as token-level F1, Bertscore, and LLM-as-a-judge. Across these metrics, model performance generally increases after fine-tuning primarily due to better conformity to the style of the answer distribution and secondarily due to improved use of retrieved contexts. We observe that relative performance depends on the quality of the retriever, emphasizing the importance of the training data distribution in improving the model's reasoning over multiple contexts. We further show that the fine-tuned model trained on synthetic data generalizes better to similar held-out QA tasks, outperforming an LLM fine-tuned on real data by 36% in LLM-judged correctness over the RepLiQA dataset. These findings motivate a system-level analysis of the marginal benefits of generator fine-tuning in RAG pipelines, providing practical insights on the utility of synthetic training data for the benefit of both RAG systems engineers and future researchers.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6354
Loading