Keywords: Real-time Object Detection, Convolutional Neural Network, Efficient Architecture, Mobile Device, Embedded Vision
Abstract: An increasing need of running Convolutional Neural Network (CNN) models on mobile devices with limited computing power and memory resource encourages studies on efficient model design. A number of efficient architectures have been proposed in recent years, for example, MobileNet, ShuffleNet, and NASNet-A. However, all these models are heavily dependent on depthwise separable convolution which lacks efficient implementation in most deep learning frameworks. In this study, we propose an efficient architecture named PeleeNet, which is built with conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy by 0.6% (71.3% vs. 70.7%) and 11% lower computational cost than MobileNet, the state-of-the-art efficient architecture. Meanwhile, PeleeNet is only half of the model size of MobileNet. We then propose a real-time object detection system by combining PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the architecture for fast speed. Our proposed detection system, named Pelee, achieves 70.9% mAP (mean average precision) on PASCAL VOC2007 dataset at the speed of 17.1 FPS on iPhone 6s and 23.6 FPS on iPhone 8. Compared to TinyYOLOv2, our proposed Pelee is more accurate (70.9% vs. 57.1%), 1.88 times lower in computational cost and 1.92 times smaller in model size. The code and models are open sourced.