Low-memory and high-performance CNN inference on distributed systems at the edge

Erqian Tang, Todor P. Stefanov

Published: 2021, Last Modified: 06 Aug 2024UCC Companion 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Nowadays, some applications need CNN inference on resource-constrained edge devices that may have very limited memory and computation capacity to fit a large CNN model. In such application scenarios, to deploy a large CNN model and perform inference on a single edge device is not feasible. A possible solution approach is to deploy a large CNN model on a (fully) distributed system at the edge and take advantage of all available edge devices to cooperatively perform the CNN inference. We have observed that existing methodologies, utilizing different partitioning strategies to deploy a CNN model and perform inference at the edge on a distributed system, have several disadvantages. Therefore, in this paper, we propose a novel partitioning strategy, called Vertical Partitioning Strategy, together with a novel methodology needed to utilize our partitioning strategy efficiently, for CNN model inference on a distributed system at the edge. We compare our experimental results on the YOLOv2 CNN model with results obtained by the existing three methodologies and show the advantages of our methodologies in terms of memory requirement per edge device and overall system performance. Moreover, our experimental results on other representative CNN models show that our novel methodology utilizing our novel partitioning strategy is able to deliver CNN inference with very reduced memory requirement per edge device and improved overall system performance at the same time.