Abstract: Highlights•Computation of self-attention Transformers is limited by the input sequence length.•We propose CAST, an efficient self-attention mechanism with clustered attention.•We propose the use of surrogate tokens to optimize self-attention in transformers.•We observe that cluster summaries enhance training efficiency and results.•CAST reduces complexity of self-attention computation from O(N2)<math><mrow is="true"><mi is="true">O</mi><mrow is="true"><mo is="true">(</mo><msup is="true"><mrow is="true"><mi is="true">N</mi></mrow><mrow is="true"><mn is="true">2</mn></mrow></msup><mo is="true">)</mo></mrow></mrow></math> to O(αN)<math><mrow is="true"><mi is="true">O</mi><mrow is="true"><mo is="true">(</mo><mi is="true">α</mi><mi is="true">N</mi><mo is="true">)</mo></mrow></mrow></math>.
Loading