Full Stack Optimization of Transformer Inference

Sehoon Kim; Coleman Hooper; Thanakul Wattanawong; Minwoo Kang; Ruohan Yan; Hasan Genc; Grace Dinh; Qijing Huang; Kurt Keutzer; Michael W. Mahoney; Sophia Shao; Amir Gholami

Full Stack Optimization of Transformer Inference

Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Sophia Shao, Amir Gholami

Published: 16 May 2023, Last Modified: 15 Jun 2023ASSYST OralReaders: Everyone

Keywords: Full stack optimization, hardware-software co-design, Transformer accelerator, scheduling, hardware-aware neural architecture search

TL;DR: We adopt a full-stack approach to optimize Transformer inference, which involves analyzing the hardware and scheduling implications of the architecture, as well as utilizing neural architecture search to adapt the network to the underlying hardware.

Abstract: Recent advances in state-of-the-art neural network architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications in computer vision, natural language processing, and speech recognition. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we pursue a full-stack approach to optimizing Transformer inference. We analyze the implications of the Transformer architecture on hardware, including the impact of nonlinear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, and we use this analysis to optimize a fixed Transformer architecture. We assess the challenges with finding the right mapping and scheduling of operations for Transformer models, and pursue neural architecture search to further optimize the Transformer network. We find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x end-to-end speedup with minimal performance degradation for Transformer inference. More details can be found in our full paper, which includes (1) a comprehensive analysis of Transformer workloads, (2) an extensive survey of the current hardware and software solutions on efficient Transformer inference, and (3) case studies to quantify the advantages of co-design and co-optimization techniques across the stack on full-stack Transformer inference.

Workshop Track: ASSYST

Presentation: In-Person

Presenter Full Name: Coleman Hooper

Presenter Email: chooper@berkeley.edu

Presenter Bio: I am a graduate student at UC Berkeley, CA, USA, pursuing a PhD in Electrical Engineering, affiliated with the Specialized Computing Ecosystems (SLICE) lab and with Berkeley AI Research (BAIR). My research interests are in hardware acceleration and hardware-software co-design for machine learning with a particular focus on NLP applications. Previously, I received a B.S. degree in Electrical Engineering from Harvard University, MA, USA.

3 Replies

Loading