TAP: Efficient Derivation of Tensor Parallel Plans for Large Neural Networks

Ziji Shi; Le Jiang; Ang Wang; Jie Zhang; Xianyan Jia; Yong Li; Chencan Wu; Jialin Li; Wei Lin

TAP: Efficient Derivation of Tensor Parallel Plans for Large Neural Networks

Ziji Shi, Le Jiang, Ang Wang, Jie Zhang, Xianyan Jia, Yong Li, Chencan Wu, Jialin Li, Wei Lin

Published: 16 May 2023, Last Modified: 15 Jun 2023ASSYST OralReaders: Everyone

Keywords: distributed learning, machine learning system, model parallelism

TL;DR: We present a framework that drastically speeds up the process of deriving the tensor parallel schedule for large neural networks.

Abstract: Model parallelism is essential to train large language models efficiently. However, determining the optimal model parallel schedule for a given neural network can be slow and inefficient due to the vast choice space. To address this challenge, we propose a tensor model parallelism framework called TAP, which automatically searches for the best data and tensor parallel schedules. Our approach is based on the observation that a neural network can be represented as a directed acyclic graph, within which only exists a limited set of frequent subgraphs. With that, we design a graph pruning algorithm that efficiently folds the search space. As a result, TAP runs at sub-linear complexity with respect to model size, which makes it a practical solution for large-scale networks. Experimental results demonstrate that TAP outperforms the state-of-the-art automatic parallelism frameworks by $20-160\times$ in searching time. Moreover, the performance of TAP's discovered schedules is competitive with expert-engineered ones. In summary, TAP provides a powerful and efficient tool for model parallelism that can help alleviate the burden of manual tuning.

Workshop Track: ASSYST

Presentation: In-Person

Presenter Full Name: Ziji Shi

Presenter Email: zijishi@comp.nus.edu.sg

Presenter Bio: Ziji Shi is a third-year Ph.D. student from National University of Singapore. His research interests lie in distributed machine learning systems and high-performance computing.

3 Replies

Loading