Accelerating convolutional neural network with FFT on tiny cores

Tahmid Abtahi, Amey M. Kulkarni, Tinoosh Mohsenin

2017 (modified: 04 Nov 2022)ISCAS 2017Readers: Everyone

Abstract: FueXeA by ILSVRC and COCO competitions, Convolutional Neural Network (CNN) has become important in computer vision, and natural language processing. However state-of-the-art CNNs are computationally and memory intensive, thus energy efficient implementation on embedded platform is challenging. Recently VGGNet and ResNet showed that deep neural networks with more convolution layers (CV) and few fully connected layer (FC) can achieve lower error rates, thus reducing the complexity of convolution layers is of utmost importance. To reduce computations and shared memory usage in convolution layers, in this paper we evaluate the performance of direct convolution (Direct-Conv), Fast Fourier Transform (FFT) based convolution (FFT-Conv), and Overlap and Add FFT convolution (FFT-OVA-Conv) in embedded architecture including a low power domain specific many-core architecture called Power Efficient Nano Clusters (PENC) and ARM Cortex A53 CPU. To demonstrate the efficiency of FFT-Conv and FFT-OVA-Conv, we map ResNet-20 for the CIFAR-10 dataset on PENC as well as in ARM Cortex A53 CPU. Results are evaluated and compared with respect to throughput per watt, energy delay product, and execution time for three methods. Using built-in FFT instruction in PENC, the FFT-OVA-Conv performs 2.9× and 1.65× faster and achieves 6.7× and 2.3× better throughput per watt than Direct-Conv and FFT-Conv respectively. In ARM A53 CPU, the FFT-OVA-Conv achieves 3.36× and 1.38× improvement in execution time and 2.72× and 1.32× better throughput than Direct-Conv and FFT-Conv.

0 Replies