Keywords: distributed training, simulation, runtime estimation, memory estimation, parallel training, infrastructure
TL;DR: TORCHSIM is a simulation-based tool that predicts the runtime and memory usage of distributed AI training configurations without execution, enabling fast, cost-effective, and accurate selection of training setups at scale.
Abstract: Large AI models unlock powerful applications but are costly and complex to train, primarily due to the challenge of configuring distributed training across GPU clusters. This involves selecting the right combination of techniques based on the model, data, hardware, and performance objectives. In practice, teams often rely on trial and error, leading to high compute costs, cloud spend, and wasted time, without guarantees of success or optimality. We present TORCHSIM, a simulator that eliminates this burden by accurately predicting whether a configuration will succeed (i.e., stay within memory limits) and how long it will take to run, without requiring actual execution or access to the target hardware. Users simply input candidate configurations and choose the best successful one, such as the fastest, avoiding costly and uncertain tuning. TORCHSIM combines analytical and learned models to estimate operator-level runtimes and employs a GPU execution simulator to capture the intricacies of multi-stream parallelism and hardware behavior. Evaluated on both language and vision models across A100 and H100 GPUs, up to 128-GPU scale, with multi-dimensional parallelism and interconnects like InfiniBand and RoCE, TORCHSIM achieves over 90% accuracy in runtime prediction and 99% in memory estimation. It is open-sourced as an extension to PyTorch, with results demonstrated on TORCHTITAN.
Submission Number: 135
Loading