Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs
Abstract: General matrix-matrix multiplication (GEMM), serving as a cornerstone of AI computations, has positioned tensor processing engines (TPEs) as increasingly critical components within existing GPUs and domain-specific architectures (DSA). Our analysis identifies that the prevailing architectures primarily focus on dataflow or operand reuse strategies, when considering the combination of matrix multiplication with multiply-accumulator (MAC) itself, it provides greater optimization space for the design of TPEs. This work introduces a novel perspective on matrix multiplication from a hardware standpoint, focusing on the bit-weight dimension of MACs. Through this lens, we propose a finer-grained TPE notation, using matrix triple loops as an example, introducing new methods and ideas for designing and optimizing PE microarchitecture. Based on the new notation and transformations, we propose four optimization techniques that achieve varying degrees of improvement in timing, area, and power consumption. We implement our design in RTL using the SMIC-28nm process. Applying our methods to four classic TPE architectures (include systolic array [20], 3D-Cube [27], multiplier-adder tree [48], and 2D-Matrix [30]), we achieved area efficiency improvements of $1.27 \times, 1.28 \times, 1.56 \times$, and $1.44 \times$, and $1.04 \times, 1.56 \times, 1.49 \times$, and $1.20 \times$ for energy efficiency respectively. When applied to a bit-slice architecture, we achieved a $12.10 \times$ improvement in energy efficiency and $2.85 \times$ in area efficiency compared to Laconic [38]. Our Verilog HDL code, along with timing, area, and power reports for circuit synthesis in URL: https://github.com/wqzustc/High-Performance-Tensor-Processing-Engines.
External IDs:dblp:conf/hpca/WuLG0HTWZZYW025
Loading