CAST: Clustering self-Attention using Surrogate Tokens for efficient transformers

Published: 01 Jan 2024, Last Modified: 31 Oct 2024Pattern Recognit. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Computation of self-attention Transformers is limited by the input sequence length.•We propose CAST, an efficient self-attention mechanism with clustered attention.•We propose the use of surrogate tokens to optimize self-attention in transformers.•We observe that cluster summaries enhance training efficiency and results.•CAST reduces complexity of self-attention computation from O(N2)O(N2) to O(αN)O(αN).
Loading