A 28nm Scalable and Flexible Accelerator for Sparse Transformer Models

Yuan Liao, Jian Meng, Jae-sun Seo

Published: 01 Jan 2024, Last Modified: 11 Nov 2024ISLPED 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Transformer-based model has been widely utilized in deep learning. The accuracy-driven applications broadly expand the model size, whereas the current hardware accelerator designs failed to adaptively alternate the scalability to match the corresponding computation intensity of different model sizes. Meanwhile, supporting the transformer models with different sizes requires flexibility for various matrix multiplication under different dimensions. On the higher level, the complex computation flow within transformer models urges a flexible data management design for accelerators. Furthermore, the massive model size enables the possibility of utilizing sparsity and eliminating the redundancy of the model. However, exploring the fine-grained sparsity on hardware remains challenging and under-explored for transformer accelerators. Finally, the non-linear functions and modules of the transformer model require a dedicated hardware design to balance the trade-off between accuracy and hardware cost. Motivated by that, we propose a novel hardware accelerator designed for transformer-based models. In particular, we propose the row-wise matrix multiplication processing elements (RMMPE) and the post-PE processors (PPE). RMMPE computes matrix multiplication in row-wise products with high data reuse. Furthermore, RMMPE efficiently handles the unstructured sparse matrix multiplication with various dimensionality, elevating the scalability and flexibility for different transformer models. PPE computes complex functions in linear approximation. The proposed accelerator achieves 17.1 TOPS peak throughput and 19.5 TOPS/W peak energy efficiency, outperforming the recent SoTA transformer accelerators.