Method of Building RAG-Powered Instruction Dataset from Raw Corporate Text Data for LLM Fine-Tuning

Published: 09 Mar 2025, Last Modified: 09 Mar 2025MathAI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Retrieval Augmented Generation, RAG, LLM, Fine-Tuning, Instruction Dataset, AI-assistant
TL;DR: Description of the method for building an instruction dataset for fine-tuning an LLM that will serve as a generator in the RAG pipeline of a corporate intelligent assistant.
Abstract: The emergence of powerful LLMs enables the creation of high-quality intelligent assistants for QA search over corporate data. There are two commonly used approaches for adapting LLMs to domain-specific data: fine-tuning and Retrieval Augmented Generation (RAG). A RAG system involves building a vector database from corporate data split into chunks of a certain size using retriever. The system then finds the most relevant chunks according to user query and passes them to a generator to extract the answer. Separately, neither method guarantees the highest quality of the final response. LLM fine-tuning suffers from the need for frequent retraining to update its knowledge, which is a costly and complex process. Also, naive RAG suffers from limitations inherent to retriever work. In this paper we present the method of building instruction dataset for fine-tuning generator LLM as a part of RAG system in corporate intelligent assistant. As an example, the instruction dataset is generated from raw corporate data of Donetsk State University (DonSU) in Russian. We suggest combining the naive RAG approach with fine-tuning an LLM on a specific instruction dataset. A third component, a relevant context retrieved from the vector database, is added to the standard input-output instruction pairs. Proposed method involves five steps: 1) collecting raw text data from university website; 2) selecting relevant texts; 3) generating synthetic input-output instruction pairs; 4) building a vector database from the relevant texts; 5) supplementing the generated pairs with context retrieved from the vector database based on similarity to the input. Fine-tuning the LLM on such a dataset will not only make it to memorize domain-specific data, but also will improve its ability to interpret information, provided by the retriever from the vector database. This is a key aspect of the generator in the RAG system, and this training approach will ensure more accurate perfrormance of the assistant. As a result, we created the RAG-powered instruction dataset from raw corporate text data about DonSU. It contains over 25,000 input-context-output records, specifically prepared for training an LLM that will serve as the generator in the developed intelligent assistant.
Submission Number: 48
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview