- Keywords: differential privacy, differentially private SGD, privacy-preserving training
- Abstract: A large number of recent studies reveal that networks and their optimization updates contain information about potentially private training data. To protect sensitive training data, differential privacy has been adopted in deep learning to provide rigorously defined and measurable privacy. However, differentially private stochastic gradient descent (DP-SGD) requires the injection of an amount of noise that scales with the number of gradient dimensions, while neural networks typically contain millions of parameters. As a result, networks trained with DP-SGD typically have large performance drops compared to non-private training. Recent works propose to first project gradients into a lower dimensional subspace, which is found by application of the power method, and then inject noise in this subspace. Although better performance has been achieved, the use of the power method leads to a significantly increased memory footprint by storing sample gradients, and more computational cost by projection. In this work, we mitigate these disadvantages through a sparse gradient representation. Specifically, we randomly freeze a progressively increasing subset of parameters, which results in sparse gradient updates while maintaining or increasing accuracy over differentially private baselines. Our experiment shows that we can reduce up to 40\% of the gradient dimension while achieve the same performance within the same training epochs. Additionally, sparsity of the gradient updates is beneficial for decreasing communication overhead when deployed in collaborative training, e.g. federated learning. When we apply our approach across various DP-SGD frameworks, we maintain accuracy while achieve up to 70\% representation sparsity, which proves that our approach is a safe and effective add-on to a variety of methods. We further notice that our approach leads to improvement in accuracy in particular for large networks. Importantly, the additional computational cost of our approach is negligible, and results in reduced computation during training due to lower computational cost in power method iterations.