Keywords: distillation, long context, efficiency, LLM, compression
TL;DR: We can apply distillation objective to LLM KV caches to achieve high compression ratios with minimal performance loss
Abstract: Sequence-to-sequence natural language tasks often benefit greatly from long contexts, but the quadratic complexity of self-attention renders usage of long contexts non-trivial. In particular, during generation, temporary representations (stored in the KV cache) account for a large portion of GPU memory usage, and scale linearly with context length. In this work, we introduce KV-Distill, a flexible compression framework for large language models (LLMs) that distills long context KV caches into significantly shorter representations. KV-Distill can be trained as a parameter-efficient adaptor for pre-trained models, and enables the compression of arbitrary spans of a context while preserving the pre-trained model's capabilities, including instruction-tuning. We do this by treating a compressed-uncompressed cache as a student-teacher pairing and applying a KL-type divergence to match the generated outputs. Our experiments show that KV-Distill outperforms other compression techniques in worst-case extractive tasks, and approaches uncompressed performance in long context question answering and summarization. Furthermore, KV-Distill can be fine-tuned on domain-specific contexts to reduce context lengths by up 95% while preserving downstream task performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12681
Loading