Keywords: systems, recurrent, architectures, kernels, cuda
TL;DR: Hardware-aware linear attention algorithm for large state sizes.
Abstract: Sequence models face stark tradeoffs between recall quality and memory efficiency. Recall -- the ability to use information over long sequences -- is critical for sequence modeling tasks ranging from information extraction to reasoning.
Prior work has shown that in theory, \textit{linear} attention models with sufficient recurrent state sizes can expand the Pareto frontier of the recall-memory tradeoff space beyond alternative architectures such as softmax attention and state space models.
However, it is difficult to scale the linear attention state size due to hardware bottlenecks. I/O aware algorithms store the linear attention states in thread registers, however state sizes beyond $\approx 3$ megabytes exhaust register memory and trigger expensive register spills.
In this work, we introduce CYLON, a hardware-aware strategy for partitioning linear attention's recurrent state across the registers of multiple GPU processors and asynchronously combining the partitions. When applying CYLON to popular architectures, such as Hedgehog and Mamba-2, we unlock 3$\times$ higher throughput compared to prior linear attention algorithms for these architectures on both Hopper and Blackwell GPUs. Finally, CYLON makes large states available to model designers by unlocking sizes (e.g., $\geq$ 131 MB) that are not achievable by the existing linear attention kernels.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 24023
Loading