Algorithm and hardware co-design co-optimization framework for LSTM accelerator using quantized fully decomposed tensor train

Mingshuo Liu; Miao Yin; Kevin Han; Ronald F. DeMara; Bo Yuan; Yu Bai

Algorithm and hardware co-design co-optimization framework for LSTM accelerator using quantized fully decomposed tensor train

Mingshuo Liu, Miao Yin, Kevin Han, Ronald F. DeMara, Bo Yuan, Yu Bai

Published: 01 Jan 2023, Last Modified: 15 Nov 2024Internet Things 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Nowadays, from companies to academics, researchers across the world are interested in developing Deep Neural Networks (DNNs) due to their incredible feats in various applications, such as image recognition, playing complex games, and large-scale information retrieval such as web search. However, when people enjoy the advantages of DNNs, the high computational and power demands on resource-constrained electronic devices of the DNN model have received more attention. Optimizing the DNN model, such as model compression, is crucial to ensure wide deployment of DNNs and promote DNNs to implement most resource-constrained scenarios. Among many techniques, tensor train (TT) decomposition is considered a very promising technology. Although our previous efforts achieve (1) expanding limits of the number of multiplications eliminating all redundant computations; and (2) decomposing into multistage processing to reduce memory traffic, the potential of this work is not fully explored.In this paper, we investigate and demystify the TT decomposition within a thoughtful hardware mind. This paper develops an efficient hardware optimization methodology within a novel hardware solution. Two key merits will be achieved in this project: (1) it enables a novel approach to apply TT-decomposition on the entire LSTM model; (2) a much more efficient quantization method has been proposed at the hardware optimization level; (3) an efficient hardware accelerator that unitizes the hardware and algorithm co-design method has been designated.Based on these novelties, the proposed work can achieve 1.69×<math><mo is="true">×</mo></math> power reduction and 2.28×<math><mo is="true">×</mo></math> power efficiency (GOPS/W) in different workloads. In addition, compared to the state-of-the-art C-LSTM, it achieves 2.09×<math><mo is="true">×</mo></math> higher throughput, 3.67% accuracy increase, 2.45×<math><mo is="true">×</mo></math> power efficiency, and a 1.18×<math><mo is="true">×</mo></math> power reduction. The results show that our proposed accelerator exhibits significant advantages over state-of-the-art solutions.

Loading