Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu

Published: 11 Sept 2024, Last Modified: 16 Sept 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC [64]) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA [97]). Essentially, GSA comprises a two-layer GLA linked via softmax, utilizing context-aware mem- ory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and infer- ence efficiency through GLA’s hardware-efficient training algorithm and reduced state size. Additionally, retaining the softmax operation is particularly beneficial in “finetuning pretrained Transformers to RNNs” (T2R [42]) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA’s superior performance in scenarios requiring in-context recall and in T2R settings.