ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads

ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads

ACL ARR 2025 February Submission8149 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the rapid development of large language models (LLMs), handling long context has become one of the vital abilities in LLMs. Such long-context ability is accompanied by difficulties in deployment, especially due to the increased consumption of KV cache. There is certain work aiming to optimize the memory footprint of KV cache, inspired by the observation that attention heads can be categorized into retrieval heads that are of great significance and streaming heads that are of less significance. Typically, identifying the streaming heads and and waiving the KV cache in the streaming heads would largely reduce the overhead without hurting the performance that much. However, since employing both retrieval and streaming heads in one layer decomposes one large round of attention computation into two small ones, it may unexpectedly bring extra latency on accessing and indexing tensors. Based on this intuition, we impose an important improvement to the identification process of retrieval and streaming heads, in which we design a criterion that enforces exclusively retrieval or streaming heads gathered in one unique layer. In this way, we further eliminate the extra latency and only incur negligible performance degradation. Our method named ZigzagAttention is competitive among considered baselines owing to reduced latency and comparable performance.

Paper Type: Short

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 8149

Loading