Track: Main paper track (up to 5 pages excluding references and appendix)
Keywords: bayesian optimization, neural network training, parallelized training
Abstract: Training of modern large neural networks (NNs) is often done in parallel across multiple GPUs. While there are existing parallel training frameworks which easily allow NN training using multi-dimensional parallelism, the challenge remains in optimizing the balance between size of the parallelism dimensions, and in tuning the hyperparameters within each parallelism dimension. Due to a large number of possible parallelism configurations (PCs) for a given training process, it is infeasible to perform exhaustive search over all candidates. Even though there exists PC optimization methods, they either rely on an approximate cost model which may be inaccurate and hardware-specific, or on a large number of NN training trials on different PCs each which are expensive to evaluate. To overcome these issues, we present OPPA, which combines Bayesian optimization with prior knowledge in the form of a parallelism-informed prior belief, to obtain an optimal PC using minimal number of NN training trials. We demonstrate that OPPA is able to more efficiently find an optimal PC for training transformers when compared to methods used in existing parallel training frameworks.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 22
Loading