FAB: FPGA-Accelerated Fully-Pipelined Bottleneck Architecture With Batching for High-Performance MobileNetv2 Inference
Abstract: Lightweight neural networks (LWNNs) primarily employ the bottleneck block (BB) introduced in MobileNetv2 or similar architectural structures. However, the channel expansion-reduction process in BB imposes substantial activation memory overhead, a challenge that has not been adequately addressed in prior studies on LWNN accelerators incorporating BB. To overcome this limitation, we propose a fully-pipelined bottleneck architecture (FPB) optimized for the efficient hardware deployment of BB. FPB eliminates the need for intermediate off-chip memory access, effectively addressing deployment challenges associated with BB and enabling an end-to-end accelerator architecture. To enhance hardware efficiency, each FPB core utilizes 2-LUT DSP, Fused-ReLU6, and Q-Residual, optimizing computational performance while minimizing resource consumption. Furthermore, we introduce a batching technique that maximizes the benefits of FPB by ensuring high hardware utilization across FPB cores while enabling the concurrent processing of multiple images. To mitigate the off-chip memory access latency inherently incurred by batching, we propose a stem layer latency hiding technique, which effectively prevents performance degradation. We evaluate the performance of our proposed MobileNetv2 accelerator on the VCU118 board, achieving an energy efficiency of 120.7 GOPS/W at a batch size of 4. This represents an improvement of $1.5\times $ to $10.5\times $ over prior work. Depending on the batch size configuration, our FAB accelerator achieves a throughput performance ranging from 204.2 GOPS to 772.7 GOPS, demonstrating its high computational efficiency.
External IDs:dblp:journals/tcasI/KimKK25
Loading