Self-RAG: Self-reflective Retrieval Augmented Generation

Akari Asai; Zeqiu Wu; Yizhong Wang; Avirup Sil; Hannaneh Hajishirzi

Self-RAG: Self-reflective Retrieval Augmented Generation

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi

Published: 28 Oct 2023, Last Modified: 26 Nov 2023Instruction Workshop @ NeurIPS 2023EveryoneRevisionsBibTeX

Keywords: Language Models, Retrieval-augmented Language Models, Retrieval Augmentation, Factuality

TL;DR: Our new framework Self-RAG enhances the quality and factuality of instruction-tuned LLMs with on-demand retrieval and self-reflection.

Abstract: Scaling up language models (LMs) or instruction tuning has shown limited effects on improving factuality of LM outputs. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments Language Models (LMs) with retrieval, decreases hallucination issues of large LMs. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes instruction-following LM versatility or can lead to unhelpful response generation. In this work, we introduce a new framework called **Self-Reflective Retrieval-Augmented Generation (Self-RAG)** that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM to learn to adaptively retrieve passages on-demand, and generate and reflect on retrieved passages and its own generations using special tokens, called *reflection* tokens, on diverse instruction-tuning data with interleaving retrieved passages and reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art pre-trained and instruction-follwing LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, fact verification and reasoning tasks, and it shows significant gains in factuality scores and citation accuracy for long-form generations relative to these models.

Submission Number: 66

Loading