ParaGAN: A Cloud Training Framework for Generative Adversarial NetworksDownload PDF

Published: 16 May 2023, Last Modified: 15 Jun 2023ASSYST OralReaders: Everyone
Keywords: Generative Adversarial Network, Distributed Machine Learning
TL;DR: We present ParaGAN, a cloud-training framework for GAN, which demonstrates near optimal scaling performance on BigGAN.
Abstract: Generative Adversarial Network (GAN) has shown tremendous success in synthesizing realistic photos and videos in recent years. However, training GAN to convergence is still a challenging task that requires significant computing power and is subject to training instability. To address these challenges, we propose ParaGAN, a cloud training framework for GAN optimized from both system and numerical perspectives. To achieve this, ParaGAN implements a congestion-aware pipeline for latency hiding, hardware-aware layout transformation for improved accelerator utilization, and an asynchronous update scheme to optimize system performance. Additionally, from a numerical perspective, we introduce an asymmetric optimization policy to stabilize training. Our preliminary experiments show that ParaGAN reduces the training time of BigGAN from 15 days to just 14 hours on 1024 TPUs, achieving 91\% scaling efficiency. Moreover, we demonstrate that ParaGAN enables the generation of unprecedented high-resolution ($1024\times1024$) images on BigGAN.
Workshop Track: MLArchSys
Presentation: In-Person
Presenter Full Name: Ziji Shi
Presenter Email: zijishi@comp.nus.edu.sg
Presenter Bio: Ziji Shi is a third-year Ph.D. student from National University of Singapore. His research interests lie in distributed machine learning systems and high-performance computing.
4 Replies

Loading