CacheFormer: High Attention-based Segment Caching

ACL ARR 2024 June Submission3580 Authors

16 Jun 2024 (modified: 19 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Although, numerous approaches have been recently presented like Linformer, Longformer, Performer, Structured state space models (SSMs) etc., yet it remains an unresolved problem. All these models strive to reduce the quadratic time complexity of the attention mechanism to approximate linear time complexity while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache and virtual memory concepts in computer architecture, we improve the work presented in Long-Short Transformer (Transformer-LS) that implements a sliding window for the short attention and compressed contextual segments for the long attention. Our enhancements include augmenting the architecture with attention on dynamically retrieved uncompressed context segments that indicate high attention at the compressed level. Similar to the cache and virtual memory principle in computers, where in case of a cache or page miss, not only the needed data is retrieved from the random-access memory or the hard disk, but the nearby following data is also obtained. On a similar note, we too retrieve the nearby segments in uncompressed form when a high attention occurs at the compressed level. We further enhance the long-short transformer by augmenting the long attention with compressed overlapping segments to reduce the loss in quality due to segment fragmentation that occurs in sequences with long context. Our results indicate significant improvements over the base line of the long-short transformer in terms of perplexity on the popular benchmarks.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: model architectures, sparse models, NLP in resource-constrained settings
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3580
Loading