Keywords: deep learning, differential privacy, distributed learning, system design
TL;DR: We develop the state-of-the-art distributed learning (ZeRO, DeepSpeed) with differential privacy.
Abstract: Deep learning with large models have achieved amazing success in a wide range of domains, but the optimization on billions of parameters is challenging in terms of the training speed, memory cost, and communication efficiency, especially under the differential privacy (DP) regime. On the one hand, DP optimization has comparable efficiency to the standard non-private optimization on a single device, but existing DP distributed learning (such as data/pipeline parallel) has significant limitations in efficiency. On the other hand, the Zero Redundancy Optimizer (ZeRO) is a state-of-the-art solution to optimize memory and improve the training efficiency on large models under the standard regime, but it encounters technical challenges to work compatibly with DP. In this work, we develop a new systematic solution, DP-ZeRO, to scale up the model size and obtain almost the same computation and communication efficiency as the standard distributed learning, in both the full and mixed precision. Our DP-ZeRO, like the standard ZeRO, has the potential to train models with arbitrary size and is evaluated on DP models that has the world's largest number of trainable parameters.
0 Replies
Loading