ShortcutFusion: From Tensorflow to FPGA-Based Accelerator With a Reuse-Aware Memory Allocation for Shortcut Data

Abstract: Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet/EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152. Most of the previous DNN compilers/accelerators ignore the shortcut data optimization. This paper presents <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ShortcutFusion</i> , an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$2.8\times $ </tex-math></inline-formula> faster and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$9.9\times $ </tex-math></inline-formula> more power efficient than NVIDIA RTX 2080 Ti for <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$256\times 256$ </tex-math></inline-formula> input size. Compared to the result from baseline, in which the weights/inputs/outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining, which also “mine” the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$5.27\times $ </tex-math></inline-formula> while accessing weight from off-chip memory exactly once.
0 Replies
Loading