FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Published: 30 May 2024, Last Modified: 08 Jun 2024MLArchSys 2024 OralPosterEveryoneRevisionsBibTeXCC BY 4.0
Workshop Track: System for Machine Learning
Presentation: In-Person
Keywords: Transformers, Attention, Spatial Architectures, Einsums
Presenter Full Name: Nandeeka Nayak
TL;DR: This work uses the cascade of Einsums abstraction to analyze the space of attention implementations and schedules an efficient variant onto a spatial architecture, resulting in an average 6.7x speedup on attention over the prior state-of-the-art.
Presenter Email: nandeeka@berkeley.edu
Abstract: Attention for transformers is a critical workload that has recently received significant "attention" as a target for custom acceleration. Yet, while prior work succeeds in reducing attention's memory-bandwidth requirements, it creates load imbalance between attention operators (resulting in severe compute under-utilization) and requires on-chip memory that scales with sequence length (which is expected to grow over time). This paper ameliorates these issues, enabling attention with nearly 100\% compute utilization, no off-chip memory traffic bottlenecks, and on-chip buffer size requirements that are independent of sequence length. The main conceptual contribution is to use a recently proposed abstraction---the cascade of Einsums---to describe, formalize and taxonomize the space of attention algorithms that appear in the literature. In particular, we show how Einsum cascades can be used to infer non-trivial lower bounds on the number of passes a kernel must take through its input data, which has implications for either required on-chip buffer capacity or memory traffic. We show how this notion can be used to meaningfully divide the space of attention algorithms into several categories and use these categories to inform our design process. Based on the above characterization, we propose FuseMax---a novel mapping of attention onto a spatial array-style architecture. On attention, in an iso-area comparison, FuseMax achieves an average $6.7\times$ speedup over the prior state-of-the-art FLAT, while using $80$\% of the energy. Similarly, on the full end-to-end transformer inference, FuseMax achieves an average $5.3\times$ speedup over FLAT using $85$\% of the energy.
Presenter Bio: Nandeeka Nayak is a rising fifth-year Computer Science PhD student at University of California, Berkeley, advised by Chris Fletcher. She works on understanding efficient implementations of domain-specific kernels with a focus on building abstractions that unify a wide variety of kernels and accelerator designs into a small set of primitives.
Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.
Dataset Release: I certify that all co-authors commit to release the dataset and necessary scripts to reproduce the presented results.
Workshop Registration: Yes, at least one of the authors has registered for the workshop (Two-Day Registration at minimum).
Submission Number: 5
Loading