- Keywords: Compact Neural Network Design, Tiny FPGA
- Abstract: Model size and computation complexity of deep convolutional neural networks (DCNNs) are two major factors governing their throughput and energy efficiency when deployed to hardware for inference. Recent works on compact DCNNs along with pruning methods are effective, yet with drawbacks. For instance, more than half the size of all MobileNet models lies in their last two layers, mainly because compact separable convolution (CONV) layers are not applicable to their last fully-connected (FC) layers. Also, in pruning methods the compression is gained at the expense of irregularity in the DCNN architecture, which necessitates additional indexing memory to address non-zero weights, thereby increasing memory footprint, decompression delays, and energy consumption. In this paper, we propose cyclic sparsely connected (CSC) architectures, with a memory/computation complexity of O(N log N) where N is the number of nodes/channels given a DCNN layer, that, contrary to compact depthwise separable layers, can be used as an overlay for both FC and CONV layers of O(N^2). Also, contrary to pruning methods, CSC architectures are structurally sparse and require no indexing due to their cyclic nature. We show that both standard convolution and depthwise convolution layers are special cases of the CSC layers and whose mathematical function, along with FC layers, can be unified into one single formulation, and whose implementation can be carried out under one arithmetic logic component. We examine the efficacy of the CSC architectures for compression of LeNet and MobileNet models with precision ranging from 2 to 32 bits. Lastly, we design a configurable application-specific hardware that implements all types of DCNN layers including FC, CONV, depthwise, CSC-FC, and CSC-CONV indistinguishably within a unified pipeline and with negligible performance stall. We configure the hardware with 16 processing engines (PEs) and 12 multiply-accumulate (MAC) units per PE for the deployment of the compressed 8-bit CSC-MobileNet-192. Compared to the state of the art, our implementation for ImageNet classification is 1.5X more energy efficient on FPGA.