Keywords: efficient attention, compression, efficient architecture
Abstract: The transformer architecture is the default choice for large language models (LLMs) but their attention layers incur computational costs that scale quadratically with context length, which is prohibitive. To reduce these costs, many works propose alternative low-cost sequence mixers that approximate attention; for example, sparse or sliding window attention limits the inputs to attention and linear attention or convolutions limit the state size by removing or approximating the softmax transformation. These alternatives have limitations; e.g to solve tasks like multi-query associate recall, sparse-attention transformers need to be deeper than vanilla transformers and linear attention needs to be composed with self-attention. To build efficient LLMs without replacing the attention mechanism itself, we develop the Compress and Attend Transformer (CAT). CAT is a simple transformer-based architecture that decodes each token while only attending to compressed chunks of the sequence so far. The chunk-size limits the compressor cost and the compression reduces the costs for the decoder by a factor of the chunk size. It follows that CATs enjoys fast and memory-efficient generation, with upto $3\times$ generation throughput and $7\times$ less memory usage compared to a dense transformer. We show that CATs match dense transformer on perplexity and common language modeling evaluations. At the same time, CATs outperform existing efficient attention-alternatives on real-world recall benchmarks, showcasing similar generation throughput and memory usage.
Submission Number: 131
Loading