CASAK-V: Dynamic Sparse Attention and Adaptive KV-Cache Compression for Memory-Efficient Long-Context LLM Inference
Keywords: Large Language Models, Sparse Attention, KV-cache Compression, Long-context Processing, Meta-learning, Adaptive Algorithms, Memory Efficiency, Inference Optimization, On-device Deployment, Context-aware Models, Dynamic Attention, Transformer Architectures, Efficient Natural Language Processing, Machine Learning Systems, Attention Mechanisms, Sparse Computation, Benchmarking, Model Compression, Resource-constrained Computing, Edge AI, Computational Complexity, Information Retrieval, Self-attention, Transfer Learning, Deep Learning, Artificial Intelligence, Chunk-wise Compression, Pattern Recognition
TL;DR: CASAK-V: A learned approach for dynamic sparse attention and KV-cache compression, enabling efficient long-context LLM inference in memory-constrained environments with minimal performance loss.
Abstract: The emergence of long-context Large Language Models (LLMs) has triggered a rapid expansion of applications across various domains. However, these models remain inaccessible for on-device or on-premises deployments due to significant computational and memory challenges. The quadratic complexity of attention mechanisms and the substantial memory requirements of KV-caches, hinder adoption in resource-constrained environments. Current solutions, such as sparse attention mechanisms and KV-cache compression techniques, often rely on pre-observed patterns or context-independent, head-specific profiling strategies, which can compromise model accuracy, especially in long-context processing. This paper introduces Context-Aware adaptive Sparse Attention with Key-Value cache compression (CASAK-V), an inference-time approach that dynamically generates and applies head-specific sparse attention patterns. CASAK-V leverages a meta-learning framework to fine-tune a compact pre-trained vision-language encoder-decoder transformer for sparse pattern identification from per-layer attention scores. These patterns include fixed local windows, dynamic column stripes, block-sparse, and various other learned hybrid configurations. The technique additionally implements adaptive chunk-wise KV-cache compression using policies adapted from these layer-wise sparse configurations. To retain context-awareness, these configuration are dynamically adjusted during token generation, based on an attention map reconstruction heuristic. Our evaluations show that CASAK-V achieves minimal performance degradation on long-context benchmarks (LongBench), while reducing memory usage by 40% and delivering near-linear runtime complexity compared to full attention and caching. In summary, CASAK-V enables efficient long-context processing in memory-limited environments, extending the applicability of LLMs and facilitating their deployment in on-premises and on-device scenarios.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14192
Loading