Self attention, the kernel of transformer models, is computationally intensive, and hence a major focus for accelerating large language model (LLM) inference. Existing methods on transformer inference acceleration often require modifying transformer architectures or using specialized hardware accelerators, which limits their broad applicability. In this paper, we introduce an innovative method, called AttnCache, to accelerate self attention inference in LLM prefill phase without the above limitations. AttnCache draws inspiration from the intriguing observation of recurring and rich similarities in attention computations across different inference sequences. Based on a memorization database that leverages emerging big memory systems, we propose embedding and efficient caching techniques to identify inputs that produce similar attention maps, thereby reducing computation overhead. Experimental results show that AttnCache achieves 1.2X speedup on average on Semantic Textual Similarity (STS) benchmarks, with only 2% performance loss.
Abstract:
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1861
Loading