SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions

Geng Yang; Jie Lei; Zhenman Fang; Jiaqing Zhang; Junrong Zhang; Weiying Xie; Yunsong Li

SA4: A Comprehensive Analysis and Optimization of Systolic Array Architecture for 4-bit Convolutions

Geng Yang, Jie Lei, Zhenman Fang, Jiaqing Zhang, Junrong Zhang, Weiying Xie, Yunsong Li

Published: 01 Jan 2024, Last Modified: 13 Nov 2024FPL 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Many studies have demonstrated that 4-bit precision quantization can maintain accuracy levels comparable to those of floating-point deep neural networks (DNNs). Thus, it has sparked a keen interest in the efficient acceleration of such compressed DNNs, especially 4-bit convolutions, on edge devices. However, we observe that conventional systolic array (SA) architectures, widely adopted for DNN acceleration, fail to fully exploit the high computational density benefits of 4 -bit DSP packing. In this paper, we conduct the first comprehensive analysis of the integration of modern DSP packing techniques (specifically, 4-bit fully DSP packing) into the 4-bit systolic array design for convolutions. First, we introduce a row-temporal weight stationary 4-bit SA dataflow that complements the loop execution order inherent in 4-bit fully DSP packing in conventional SAs, which is called BaseSA. Next, we analyze the performance and resource efficiency of BaseSA, and identify two inefficiencies in the integration: 1) excessive LUT resource utilization that constraints the overall SA size, and 2) large latency gap to the theoretical optimum, due to various stalls in data supplies. To overcome these obstacles, we propose SA4: an HLS-based, customizable, and ultra-efficient hierarchical $\underline{\text { SA}}$ architecture optimized for 4 -bit convolutions. The core unit in SA4 is a delicately designed cost-effective SA unit (SAU), which 1) replaces the costly buffer-based data suppliers for activations and weights with shift-register-based ones, 2) replaces LUT-intensive FIFO connections between SA PEs (processing elements) with registers, and 3) replaces the finite state machines (FSM) and data unpacking logic inside each PE with a global FSM inside each SAU and a data splitter shared by a column of PEs. While such an SAU can only support a small spatial size for an SA due to its delicate design, we further scale it out using an array of SAUs. Experimental results show that our proposed SA4 achieves 1153.2 GOPS on the AMD-Xilinx Ultra96-V2 FPGA, with a $13.8 \times$ increase in GOPS/DSP efficiency and a $49 \times$ increase in GOPS/kLUTs efficiency compared to a straightforward SA and 4-bit DSP packing integration. Our SA4 project is open sourced here: https://github.com/Michaela1224/SA4.

Loading