Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation

Dongseok Im, Gwangtae Park, Zhiyong Li, Junha Ryu, Hoi-Jun Yoo

Published: 01 Jan 2023, Last Modified: 06 Nov 2023HPCA 2023Readers: Everyone

Abstract: Deep neural networks (DNNs) have achieved high performance in many AI fields such as 1-D language, 2-D image, and 3-D point cloud processing applications. Since recent DNN tasks require dense matrix operations with various bit-precision and non-ReLU activation functions, mobile neural processing units (NPUs) suffer from the acceleration of diverse DNN tasks within their limited hardware resources and power budget. Although bit-slice architectures benefit from slice-level computation and slice-level sparsity exploitation, the conventional bit-slice representation is inefficient in bit-slice architectures resulting in poor dense DNN execution. This paper proposes the efficient signed bit-slice architecture, Sibia, with the signed bit-slice representation (SBR) for efficient dense DNN acceleration. The SBR adds a sign bit to each bit-slice and changes signed 1111 <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</inf> bit-slice to 0000 <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</inf> by borrowing a value of 1 from its lower order of the bit-slice. This scheme generates large numbers of zero bit-slices in dense DNNs even not relying on accuracy-sensitive pruning methods or retraining processes. Moreover, the SBR balances positive and negative values of 2’s complement data, allowing accurate bit-slice-based output speculation that pre-computes high orders of bit-slices. Sibia integrates the signed multiplier-and-accumulate (MAC) units for efficient signed bit-slice computations, and the flexible zero skipping processing element (PE) supports the zero input bit-slice skipping and output skipping for high throughput and energy-efficiency. Additionally, the dynamic sparsity monitoring unit monitors sparsity ratio between input and weight data and determines the more sparse one for zero bit-slice skipping. The heterogeneous network-on-chip (NoC) benefits from data reusability during bit-slice computation, reducing transmission bandwidth. Finally, Sibia outperforms the previous bit-slice architecture, Bit-fusion, over 3.65× higher area-efficiency, 3.88× higher energy-efficiency, and 5.35× higher throughput.

0 Replies