Accelerated Device Placement Optimization with Contrastive Learning

Hao Lan, Li Chen, Baochun Li

2021 (modified: 18 Nov 2022)ICPP 2021Readers: Everyone

Abstract: With the ever-increasing size and complexity of deep neural network models, it is difficult to fit and train a complete copy of the model on a single computational device with limited capability. Therefore, large neural networks are usually trained on a mixture of devices, including multiple CPUs and GPUs, of which the computational speed and efficiency are drastically affected by how these models are partitioned and placed on the devices. In this paper, we propose Mars, a novel design to find efficient placements for large models. Mars leverages a self-supervised graph neural network pre-training framework to generate node representations for operations, which is able to capture the topological properties of the computational graph. Then, a sequence-to-sequence neural network is applied to split large models into small segments so that Mars can predict the placements sequentially. Novel optimizations have been applied in the placer design to achieve the best possible performance in terms of the time needed to complete training the agent for placing models with very large sizes. We deployed and evaluated Mars on benchmarks involving Inception-V3, GNMT, and BERT models. Extensive experimental results show that Mars can achieve up to 27.2% and 2.7% speedup of per-step training time than the state-of-the-art for GNMT and BERT models, respectively. We also show that with self-supervised graph neural network pre-training, our design achieves the fastest speed in discovering the optimal placement for Inception-V3.

0 Replies