A Pipelining Strategy for Accelerating Convolutional Networks on ARM Processors

Published: 01 Jan 2019, Last Modified: 20 May 2025PAAP 2019EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Convolutional neural networks (CNN) is playing an important role in many fields. Many applications are able to run the inference process of CNN with pre-trained models on mobile devices in these days. Improving performance of embedded processors such as ARM-based CPUs makes it possible to meet the requirement of real-time processing. In this paper, a pipelining strategy is proposed to accelerate convolution networks on ARM processors. We implement a \(3\times 3\) convolution with Neon instructions which are single instruction and multiple data (SIMD) instructions supported by ARM processors. In order to reduce stalls in the pipeline, issue orders of instructions are rearranged according to the out-of-order execution and dual-issue mechanism on ARM processors. A tiling method is exploited to increase data reuse. The input feature map is divided into multiple \(6\times 6\) tiles, and the computations within the tile is highly optimized using our proposed pipelining strategy. The speedup of proposed method is 2.88 compared with gcc compiled codes on RK3288. The effect of our optimizing method is measured by a performance profiling tool, cycles and cache misses are decreased significantly. The multi-thread version implemented with openMP achieve speedup of 6.8 compared with single-thread gcc complied version.
Loading