OSCAR: Online Soft Compression for RAG

ICLR 2026 Conference Submission13720 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RAG, Compression, Embedding, Efficiency, Question Answering
TL;DR: Oscar is the first online query-dependent compression method which enables x2-x5 speed-up of RAG pipelines with little to no accuracy loss.
Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, leading to improved accuracy and relevance. However, scaling RAG pipelines remains computationally expensive as context length grows. On one hand, hard compression methods have recently proposed to prune the retrieved text on-the-fly with a limited compression ration. On the other hand, soft compression method performs a costly offline compression thanks a dedicated LLM but with a higher compression rate. In this paper, we introduce OSCAR, a novel query-dependent online soft compression method for RAG. OSCAR bridges the gap between online hard and offline soft compression methods, bringing the best of both: OSCAR dynamically compresses retrieved information at inference time, eliminating storage overhead and enabling higher compression rates than existing methods. Our experiments demonstrate state-of-the-art performance with a 2-5x speed-up in inference and minimal, if any, accuracy loss, for LLMs ranging from 1B to 24B parameters.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13720
Loading