Keywords: Large Language Models, Long-context, Context-based Q&A, Efficiency, Compression
TL;DR: Compressing embeddings before feeding them to the LLM leads to lower latency, lower memory footprint, and better results for long-context Q&A
Abstract: Large Language Models (LLMs) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context compression methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic compression complexity and an inability to reuse computations across queries with overlapping contexts.
In this work, we introduce CompLLM, a soft compression technique designed for practical deployment. Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the compression step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be cached and reused across different queries.
Our extensive experiments show that with a 2x compression rate, CompLLM speeds up Time To First Token (TTFT) by up to 4x, reduces the KV cache size by 50%, and effectively doubles an LLM's maximum context length. Furthermore, CompLLM achieves performance competitive with using the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.
Primary Area: generative models
Submission Number: 9612
Loading