Keywords: distributed training, scaling, pretraining, hardware-software, parallelization, efficiency, utilization, training, performance.
TL;DR: Increased communication overhead dominates neural network training as the speed of computation and number of devices utilized for training increases.
Abstract: Dramatic increases in the capabilities of neural network models in recent years
are driven by scaling model size, training data, and corresponding computational
resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed
across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this
work, we demonstrate that careful consideration of hardware configuration and
parallelization strategy is critical for effective (i.e. compute- and cost-efficient)
scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads
across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from
certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the
total number of accelerators for large model training quickly yields diminishing
returns even when hardware and parallelization strategies are properly optimized,
implying poor marginal performance per additional unit of power or GPU-hour.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3979
Loading