How to Avoid Zero-Spacing in Fractionally-Strided Convolution? A Hardware-Algorithm Co-Design Methodology

Abstract: Fractionally Strided Convolution (FSC) is a key operation in popular image-based Deep Learning models, for example, back propagation in CNN training, the decoding stage of convolutional auto-encoders and generative CNNs (GAN), etc. FSC typically performs up-convolution on a 2-D grid image, i.e., expands it to a larger one, as compared to conventional (down)-convolution, resulting in more complex computation patterns. Specifically, it introduces additional interleaved zero-spacing (i.e. insertion and padding of zeros) in feature maps that impose excessive computation and memory access overheads on traditional convolution methods such as im2col. The resulting hardware under-utilization is especially severe in layers with large kernels and large strides, commonly seen in typical CNNs and Generative CNNs. In this paper, we propose a methodology to address this challenge using a multi-channel-multi-kernel parallel algorithm, kn2row, to eliminate zero-computations in FSC. We further develop a unified accelerator for kn2row-based convolution and FSC operations in High-Level Synthesis (HLS). Benefiting from the compute-reduction of kn2row, we achieve up to 14.6x improvement in effective resource utilization in typical convolutional auto-decoding layers, GAN layers and backward pass of Nature-CNN, a reinforcement learning bench-marking model. These lead to overall speedup of up to 3.8x in the complete forward or backward propagation phases of the above benchmarks. Our methodology leads up to 8x speedup and 11x better power efficiency than general-purpose processors. Compared with existing GAN accelerators, our methodology achieves higher normalized throughput with high portability.
0 Replies
Loading