Keywords: PyTorch, LLM, diffusion models, distributed training, torch.compile, FSDP, Tensor Parallel, Pipeline Parallel, Context Parallel, Expert Parallel, large-scale training
Abstract: TorchTitan is a PyTorch native open-source platform (GitHub: https://github.com/pytorch/torchtitan) designed for scalable and flexible training of generative AI models. Integrated tightly with PyTorch's distributed stack while offering efficient optimizations and modular configurations, TorchTitan showcases elastic training of LLMs with composable 4-D parallelism. Moreover, TorchTitan supports extensible abstractions to experiment with new model architectures (e.g., diffusion models) or infrastructure techniques (e.g., a compiler-first FSDP implementation), while biasing towards a clean, minimal codebase. This paper presents the motivation, system architecture, and demonstrated impact of TorchTitan, underscoring its alignment with the CODEML mission to advance open, sustainable machine learning development.
Submission Number: 44
Loading