StreamAttention: Energy-Efficient and High-Utilization Attention on Systolic Hardware

Olav Førland; H. T. Kung

StreamAttention: Energy-Efficient and High-Utilization Attention on Systolic Hardware

Olav Førland, H. T. Kung

Published: 01 Jun 2026, Last Modified: 01 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: systolic array, hardware accelerator, attention, transformer inference, FlashAttention, energy-efficient AI, softmax, Chebyshev approximation, hardware-software co-design, low-power machine learning

TL;DR: StreamAttention runs the full attention layer on one systolic array, reaching 95–98% utilization and 2.9× lower power than a weight-stationary baseline.

Abstract: Attention computation represents the dominant cost in modern transformers as it grows quadratically with the sequence length. FlashAttention [Dao et al., 2022] cuts the memory cost through tiling and online softmax computation. But the underlying hardware itself remains the bottleneck. Modern accelerators are optimized for matrix multiplication, while the non-linear softmax operation is typically offloaded to much lower-throughput vector units, leading to pipeline stalls. We present StreamAttention, an accelerator co-designed with FlashAttention-2 [Dao, 2023] that sustains continuous streaming of operands on a single systolic array of multiply-accumulate (MAC) units. We map online attention to MAC recurrences that fit the systolic dataflow exactly, and evaluate the softmax exponential as a Chebyshev polynomial with the same MAC units. A four-stage pipeline overlaps every attention phase with no idle cycles between tiles, keeping operands continuously streaming through the array. StreamAttention achieves 95–98% utilization against $\sim$40% for the closest peer SystolicAttention [Lin et al., 2025]. We demonstrate that the Chebyshev approximation has negligible impact on Llama-3.2-1B (WikiText-103 Perplexity), ViT-Base (ImageNet Top-1), and BERT-Large (SQuAD F1). Furthermore, fusing softmax onto the array eliminates per-tile SRAM round-trips of the score and softmax matrices, cutting per-layer attention energy by up to $\sim$2.8$\times$ and average attention power by up to $\sim$2.9$\times$ compared to our baseline. This comes at the cost of a $\sim$34% area overhead and +6% power on pure matrix multiplication over a standard systolic-based matrix unit. We synthesize against the open source SkyWater 130 nm process design kit for area and power, and SRAM energies are from CACTI 7 at 90 nm.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 191

Loading