Abstract: Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -- stored in the so-called KV cache -- account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-DISTILL , a Transformer compression framework that distills long context KV caches into significantly shorter representations in a $\textit{question-independent}$ fashion. KV-DISTILL can be trained as a parameter-efficient adaptor for pre-trained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-DISTILL outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-DISTILL across various model sizes and architectures.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP,Generation,Summarization,Language Modeling
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 5827
Loading