TorchTitan: A PyTorch Native Platform for Training Generative AI Models

Tianyu Liu; Wanchao Liang

TorchTitan: A PyTorch Native Platform for Training Generative AI Models

Tianyu Liu, Wanchao Liang

Published: 09 Jun 2025, Last Modified: 14 Jul 2025CODEML@ICML25EveryoneRevisionsBibTeXCC BY 4.0

Keywords: PyTorch, LLM, diffusion models, distributed training, torch.compile, FSDP, Tensor Parallel, Pipeline Parallel, Context Parallel, Expert Parallel, large-scale training

Abstract: TorchTitan is a PyTorch native open-source platform (GitHub: https://github.com/pytorch/torchtitan) designed for scalable and flexible training of generative AI models. Integrated tightly with PyTorch's distributed stack while offering efficient optimizations and modular configurations, TorchTitan showcases elastic training of LLMs with composable 4-D parallelism. Moreover, TorchTitan supports extensible abstractions to experiment with new model architectures (e.g., diffusion models) or infrastructure techniques (e.g., a compiler-first FSDP implementation), while biasing towards a clean, minimal codebase. This paper presents the motivation, system architecture, and demonstrated impact of TorchTitan, underscoring its alignment with the CODEML mission to advance open, sustainable machine learning development.

Submission Number: 44

Loading