Abstract: DNN models are becoming increasingly larger to
achieve unprecedented accuracy, and the accompanying increased
computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies
to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training
throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors
make it challenging to construct an accurate performance model.
In this article, we present Proteus, the first standalone simulator
to model the performance of complex parallelization strategies
through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy
Tree. Then, it compiles the strategy tree into a distributed execution
graph and simulates the complex runtime behaviors, comp-comm
overlap and bandwidth sharing, with a Hierarchical Topo-Aware
Executor (HTAE). We finally evaluate Proteus across a wide variety
of DNNs on three hardware configurations. Experimental results
show that Proteus achieves 3.0% average prediction error and
preserves order for training throughput of various parallelization
strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to 133.8%.
Loading