SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models

SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models

ACL ARR 2024 June Submission3683 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have achieved impressive performance across various domains, but the limited context window and the expensive computational cost of processing long texts restrict their more comprehensive application. In this paper, we propose Selective Compression Attention (SCA), a general and effective method to expand the context window and reduce memory footprint by compressing the KV cache of LLMs. Specifically, through preliminary experiments, we found that the KV cache contains many similar vectors, resulting in information redundancy, which can be compressed by retaining representative vectors and discarding others. Therefore, SCA continuously selects the most distinctive vectors to keep through a greedy algorithm, reducing information loss during compression. Extensive experiments on various tasks verify the effectiveness of our method. Compared with existing methods, SCA can significantly reduce the impact on model performance under the same compression ratio. Furthermore, the context window of LLMs can be efficiently expanded using SCA without any training, which can even achieve better performance than specially fine-tuned long context models.

Paper Type: Long

Research Area: Generation

Research Area Keywords: efficient models,inference methods,NLP in resource-constrained settings

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 3683

Loading