MetaDist: An Infrastructure for Automatic Parallelism via ShardCombine Algorithm

Shenggan Cheng; Lansong Diao; Zongyan Cao; Siyu Wang; Wei Lin; Yang You

MetaDist: An Infrastructure for Automatic Parallelism via ShardCombine Algorithm

Shenggan Cheng, Lansong Diao, Zongyan Cao, Siyu Wang, Wei Lin, Yang You

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Automatic Parallelism, Distributed Training, Machine Learning Framework, Single Program Multiple Data

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: As models become larger and hardware limitations widen, parallel training techniques have become increasingly important for improving training efficiency. However, the choice and combination of these techniques can greatly impact their effectiveness. Automatic parallelism methods have emerged to select the best combination of strategies from a selection space of parallel strategies. However, these methods rely heavily on manual annotation of operator SPMD sharding rules, which makes them difficult to develop, maintain and benchmark, and lacking in ecological compatibility. In this work, we present MetaDist, an infrastructure for automatic parallelism. We propose two abstract data structures, MetaOp and MetaIR, which enable us to construct the MetaSPMD space. The ShardCombine Algorithm obviates the need for manual annotation, significantly reducing the development and maintenance cost. Moreover, our approach is natively compatible with multiple ecologies, including PyTorch and JAX. To validate our design, we implement two baseline automatic parallelism algorithms based on MetaDist. Our experiments demonstrate that our approach achieves state-of-the-art performance compared with other distributed solutions.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4633

Loading