ParaGAN: A Cloud Training Framework for Generative Adversarial Networks

Ziji Shi; Fuzhao Xue; Jialin Li; Yang You

ParaGAN: A Cloud Training Framework for Generative Adversarial Networks

Ziji Shi, Fuzhao Xue, Jialin Li, Yang You

Published: 16 May 2023, Last Modified: 15 Jun 2023ASSYST OralReaders: Everyone

Keywords: Generative Adversarial Network, Distributed Machine Learning

TL;DR: We present ParaGAN, a cloud-training framework for GAN, which demonstrates near optimal scaling performance on BigGAN.

Abstract: Generative Adversarial Network (GAN) has shown tremendous success in synthesizing realistic photos and videos in recent years. However, training GAN to convergence is still a challenging task that requires significant computing power and is subject to training instability. To address these challenges, we propose ParaGAN, a cloud training framework for GAN optimized from both system and numerical perspectives. To achieve this, ParaGAN implements a congestion-aware pipeline for latency hiding, hardware-aware layout transformation for improved accelerator utilization, and an asynchronous update scheme to optimize system performance. Additionally, from a numerical perspective, we introduce an asymmetric optimization policy to stabilize training. Our preliminary experiments show that ParaGAN reduces the training time of BigGAN from 15 days to just 14 hours on 1024 TPUs, achieving 91\% scaling efficiency. Moreover, we demonstrate that ParaGAN enables the generation of unprecedented high-resolution ($1024\times1024$) images on BigGAN.

Workshop Track: MLArchSys

Presentation: In-Person

Presenter Full Name: Ziji Shi

Presenter Email: zijishi@comp.nus.edu.sg

Presenter Bio: Ziji Shi is a third-year Ph.D. student from National University of Singapore. His research interests lie in distributed machine learning systems and high-performance computing.

4 Replies

Loading