Keywords: foundation model training, distributed systems, automatic parallelism
Abstract: The rapid scaling of large language models (LLMs) has elevated parallel configuration tuning to a central challenge. Most existing frameworks rely on labor-intensive manual tuning. While recent advances attempt to automate this process and reduce reliance on expert intervention, these approaches often depend on highly accurate cost models. In practice, such models frequently fall short due to the challenge in exact modeling, leading to suboptimal configurations. To address the limitation, this work introduces \textit{FlexParallel}, a framework that integrates an uncertain-aware grey-box cost surrogate, a sample-efficient parallelism explorer, and an adaptive stopping criteria, to automatically discover high-performance parallelism configuration. We evaluate the effectiveness of FlexParallel through extensive experiments spanning diverse model architectures, parameter scales, sequence lengths, and cluster sizes. To our best knowledge, this work presents the first empirical evaluation of automatic parallelism tuner on a cluster of up to 8,192 devices. Experimental results demonstrate that, with a limited number of exploration steps, FlexParallel achieves an average speedup of 1.06$\times$ over manual expert tuning, and up to 1.12$\times$ in the best case.
Supplementary Material: pdf
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 24012
Loading