Would decentralization hurt generalization?Download PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Keywords: decentralized SGD, flat minima, generalization, implicit regularization, large batch training
Abstract: Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices without controlling of a central server. Existing theory suggests that the decentralization degrades the generalizability, which conflicts with experimental results in the large-batch settings that D-SGD generalize better than centralized SGD (C-SGD). This work presents new theory that reconciles the conflict between the two perspectives. We prove that D-SGD introduces an implicit regularization that simultaneously penalizes (1) the sharpness of the learned minima and (2) the consensus distance between the consensus model and local models. We then prove that the implicit regularization is amplified in the large-batch settings when the linear scaling rule is applied. We further analyze the escaping efficiency of D-SGD, which suggests that D-SGD favors super-quadratic flat minima. Experiments are in full agreement with our theory. The code will be released publicly. To our best knowledge, this is the first work on the implicit regularization and escaping efficiency of D-SGD.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Theory (eg, control theory, learning theory, algorithmic game theory)
TL;DR: D-SGD introduces an implicit regularization that penalizes the learned minima’s sharpness.
19 Replies

Loading