Abstract: Partitioning a CNN and parallel executing inference with multiple IoT devices have gained popularity as a way to meet real-time requirements without sacrificing model accuracy. However, existing algorithms have struggled to find the optimal model partitioning granularity for complex CNNs. Additionally, executing inference with heterogeneous IoT devices is NP-hard when the structure of the CNN is a directed acyclic graph (DAG) rather than a chain. In this paper, we introduce a versatile and cooperative inference framework that combines both model and data parallelism to accelerate CNN inference. DeepZoning employs two algorithms at different levels: (1) a low-level Adaptive Workload Partition algorithm that uses linear programming and takes spatial and channel dimensions into optimization during the search for feature map distribution on heterogeneous devices, and (2) a high-level Model Partition algorithm that finds the optimal model granularity and organizes complex CNNs into sequential zones to balance communication and computation during execution.
0 Replies
Loading