Abstract: Tensor-train (TT) decomposition enables ultra-high compression ratio, making the deep neural network (DNN) accelerators based on this method very attractive. TIE, the state-of-the-art TT based DNN accelerator, achieved high performance by leveraging a compact inference scheme to remove unnecessary computations and memory access. However, TIE increases memory costs for stage-wise intermediate results and additional intra-layer data transfer, leading to limited speedups even the models are highly compressed. To unleash the full potential of TT decomposition, this paper proposes ETTE, an algorithm and hardware co-optimization framework for Efficient Tensor-Train Engine. At the algorithm level, ETTE proposes new tensor core construction and computation ordering mechanism to reduce stage-wise computation and storage cost at the same time. At the hardware level, ETTE proposes a lookahead-style across-stage processing scheme to eliminate the unnecessary stage-wise data movement. By fully leveraging the decoupled input and output dimension factors, ETTE develops an efficient low-cost memory partition-free access scheme to efficiently support the desired matrix transformation. We demonstrate the effectiveness of ETTE via implementing a 16-PE hardware prototype with CMOS 28nm technology. Compared with GPU on various workloads, ETTE achieves 6.5× − 253.1× higher throughput and 189.2× − 9750.5× higher energy efficiency. Compared with the state-of-the-art DNN accelerators, ETTE brings 1.1× − 58.3×, 2.6× − 1170.4× and 1.8× − 2098.2× improvement on throughput, energy efficiency and area efficiency, respectively.
0 Replies
Loading