Abstract: Highlights•Computation of self-attention Transformers is limited by the input sequence length.•We propose CAST, an efficient self-attention mechanism with clustered attention.•We propose the use of surrogate tokens to optimize self-attention in transformers.•We observe that cluster summaries enhance training efficiency and results.•CAST reduces complexity of self-attention computation from O(N2) to O(αN).