Abstract: Comparison between a general Top-$K$ sparse attention and our Log-linear Sparse Attention (LLSA). In the example, we use a token sequence of length $N=8$, block size $B=2$, Top-$K$ parameter $K=1$. To reduce the complexity of the selection stage from $O(N^2)$ to $O(N)$, we extend single-level selection to $O(\log N)$ levels. To achieve this, we compute the Top-$K$ of the full sequence on the coarsest level and recursively compute the sparse Top-$K$ on the remaining levels. To preserve the global context for attention, we enrich the key, value sets for each query with coarse tokens of length $O(K \log N)$ found in the selection stage.
Loading