Scaling Infrastructure to Support Multi-Trillion Parameter LLM TrainingDownload PDF

Published: 16 May 2023, Last Modified: 15 Jun 2023ASSYST OralReaders: Everyone
Keywords: Large Language Models, multi-trillion parameter models, LLM scaling, LLM training, system co-design, tensor offloading, system efficiency
TL;DR: We show how to scale LLM training to 128T parameter models, 16,384 GPUs, with 75%+ MFU
Abstract: This paper discusses efficient system designs for Large Language Model (LLM) scaling to up to 128 trillion parameters. We use a comprehensive analytical performance model to analyze how such models could be trained on current systems while maintaining 75% model FLOPS utilization (MFU). We first show how tensor offloading can be used to dramatically increase the size of trainable LLMs while keeping other system constants similar. We analyze performance bottlenecks when scaling on systems up to 16,384 GPUs and with models up to 128T parameters. Our findings suggest that current H100 GPUs with 80 GiB of HBM enabled with 512 GiB of tensor offloading capacity allows scaling to 11T-parameter LLMs; and getting to 128T parameters requires 120 GiB of HBM and 2 TiB of offloading memory, yielding 75%+ MFU, which is uncommon even when training much smaller LLMs today.
Workshop Track: ASSYST
Presentation: In-Person
Presenter Full Name: Mikhail Isaev
Presenter Email:
Presenter Bio: Mikhail Isaev is a last-year Ph.D. student at School of Computational Science and Engineering, Georgia Institute of Technology. His research interests lie at the intersection of computer architecture, high-performance computing, and deep learning. His work focuses on deep learning workload analysis and software-hardware co-design of large scale deep learning systems. He has received Dr. Sudhakar Yalamanchili Award for his contribution to the field of computer modeling and simulation at ModSim’22
3 Replies