Reuse Kernels or Activations?: A Flexible Dataflow for Low-latency Spectral CNN AccelerationDownload PDFOpen Website

2020 (modified: 24 Apr 2023)FPGA 2020Readers: Everyone
Abstract: Spectral-domain CNNs have been shown to be more efficient than traditional spatial CNNs in terms of reducing computation complexity. However they come with a 'kernel explosion' problem that, even after compression (pruning), imposes a high memory burden and off-chip bandwidth requirement for kernel access. This creates a performance gap between the potential acceleration offered by compression and actual FPGA implementation performance, especially for low-latency CNN inference. In this paper, we develop a principled approach to overcoming this performance gap and designing a low-latency, low-bandwidth, spectral sparse CNN accelerator on FPGAs. First, we analyze the bandwidth-storage tradeoff of sparse convolutional layers and locate communication bottlenecks. We then develop a dataflow for flexibly optimizing data reuse in different layers to minimize off-chip communication. Finally, we propose a novel scheduling algorithm to optimally schedule the on-chip memory access of multiple sparse kernels and minimize read conflicts. On a state-of-the-art FPGA platform, our design reduces data transfers by 42% with DSP utilization up to 90% and achieves inference latency of 9 ms for VGG16, compared to the baseline state-of-the-art latency of 68 ms.
0 Replies

Loading