Keywords: context compression, language models
TL;DR: Encoder-decoder compression at scale.
Abstract: Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially, require considerable time, or compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pretraining many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pretrain 0.6B-encoder 4B-decoder models on over \tokcount tokens at compression ratios of 1:4, 1:8, and 1:16. We introduce \textit{Latent Context Language Models} (LCLMs), a family of compressors that improve the Pareto frontier between general-task performance, compression speed, and peak memory usage.
We demonstrate LCLMs offer a practical path to long-context agents on complex real-world tasks as efficient backbones for agentic RAG systems.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 95
Loading