Keywords: large language models, context compression, compression
TL;DR: We develop a simple mean-pooling approach for soft context compression that consistently outperforms the widely used compression-tokens architecture, and study multi-ratio compression training across model families, scales, and datasets.
Abstract: A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation.
We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios.
We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios.
Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios.
More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19946
Loading