ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale
Abstract: As deep learning models and input data continue to scale at an unprecedented rate, it has become inevitable to move towards distributed training platforms to fit the models and increase training throughput. State-of-the-art distributed training systems are adopting emerging approaches and techniques such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and optimized parallelization strategies. This results in a complex software/hardware co-design stack, necessitating a modeling/simulation infrastructure for design-space exploration. This paper introduces ASTRA-sim2.0, which extends the open-source ASTRA-sim infrastructure with capabilities to model state-of-the-art and emerging distributed training models and platforms. Specifically, we enable ASTRAsim to (i) support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with the capability to simulate target systems at scale through analytical performance estimation, and (iii) enhance memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With these capabilities, we conduct comprehensive case studies targeting emerging distributed models and platforms. ASTRA-sim2.0 enables system designers to swiftly traverse the complex co-design stack and gain meaningful insights when designing and deploying distributed training platforms at scale.
Loading