Towards A Reconfigurable Systolic Array with Multi-Level Packing for TransformersDownload PDF

Published: 16 May 2023, Last Modified: 15 Jun 2023ASSYST OralReaders: Everyone
Abstract: Transformer-based models has achieved remarkable success in extensive tasks for natural language processing. To handle the variable-length sentences in human language, prior works suffer from low hardware efficiency due to either the shape mismatch between fixed-shape PEs (processing elements) and variable-shape workloads with data parallelism or large bubbles with pipeline parallelism. This ongoing work proposes a hybrid parallelism mixed with data parallelism for linear operators and pipeline parallelism for the attention. We develop a reconfigurable systolic array with multi-level packing to improve hardware efficiency. First, linear operators for different inputs can be packed along the array columns to improve spatial efficiency. Meanwhile, to boost temporal efficiency, we develop a head-level pipeline for attention with different stages packed on the array. We further skip the redundant computation in the masked attention by packing the computation of two heads along time. Packing decisions are explored with a dynamic programming based algorithm to maximize the overall throughput. Applied to GPT, our FPGA design has achieved $1.16\times$ higher normalized throughput and $1.94\times$ better runtime MAC utilization over the state-of-the-art GPU performance for variable-length sequences from MRPC, RTE and SQuADv2 datasets.
Workshop Track: ASSYST
Presentation: In-Person
Presenter Full Name: Tiandong Zhao
Presenter Email: zhaotiandong@ucla.edu
Presenter Bio: Tiandong Zhao is a Ph.D. student in the Electrical and Computer Engineering department at University of California, Los Angeles (UCLA), advised by Lei He. His research interests include compiler and computer architecture for deep learning applications.
3 Replies

Loading