LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

TMLR Paper3275 Authors

01 Sept 2024 (modified: 23 Dec 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the “needle” unit. In contrast, the readers only need to generate answers from the short retrieved units. The imbalanced “heavy” retriever and “light” reader design can lead to sub-optimal performance. The loss of contextual information in the short, chunked units may increase the likelihood of introducing hard negatives during the retrieval stage. Additionally, the reader might not fully leverage the capabilities of recent advancements in LLMs. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a “long retriever” and a “long reader”. In the two Wikipedia-based datasets, NQ and HotpotQA, where the average document size is less than 1K tokens, LongRAG processes the entire Wikipedia corpus into 4K-token units by grouping related documents, making these units 30 times longer than before. By increasing the unit size, we significantly reduce the total number of units from 22M to 600K. This greatly reduces the burden on the retriever, resulting in strong retrieval performance with only a few (less than 8) top units. Compared to traditional RAG, which may require hundreds of short units to achieve similar retrieval performance, our approach minimizes the likelihood of retrieving hard negatives while maintaining semantic integrity of each unit. Then we feed these retrieved units (≈ 30K tokens) to an existing long-context LLM to perform zero-shot answer generation. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ and 64.3% on HotpotQA, which are on par with the (fully-trained) SoTA model. Furthermore, we test on two non-Wikipedia-based datasets, Qasper and MultiFieldQA-en, where the average document length is already above 4K tokens. LongRAG processes each individual document as a single (long) unit rather than chunking them into smaller units. By doing so, we achieve an F1 score of 25.9% on Qasper (previously 22.5%) and 57.5% on MultiFieldQA-en (previously 51.2%). Our study offers insights into the future roadmap for combining RAG with long-context LLMs.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Zhe_Gan1

Submission Number: 3275

Loading