Adaptive Compression for Communication-Efficient Distributed Training
Abstract: We propose Adaptive Compressed Gradient Descent (AdaCGD) -- a novel optimization algorithm for communication-efficient training of supervised machine learning models with adaptive compression level. Our approach is inspired by the recently proposed three point compressor (3PC) framework of Richtarik et al. (2022) , which includes error feedback (EF21), lazily aggregated gradient (LAG), and their combination as special cases, and offers the current state-of-the-art rates for these methods under weak assumptions. While the above mechanisms offer a fixed compression level or adapt between two extreme compression levels, we propose a much finer adaptation. In particular, we allow users to choose between selected contractive compression mechanisms, such as Top-$K$ sparsification with a user-defined selection of sparsification levels $K$, or quantization with a user-defined selection of quantization levels, or their combination. AdaCGD chooses the appropriate compressor and compression level adaptively during the optimization process. Besides i) proposing a theoretically-grounded multi-adaptive communication compression mechanism, we further ii) extend the 3PC framework to bidirectional compression, i.e., allow the server to compress as well, and iii) provide sharp convergence bounds in the strongly convex, convex, and nonconvex settings. The convex regime results are new even for several key special cases of our general mechanism, including 3PC and EF21. In all regimes, our rates are superior compared to all existing adaptive compression methods.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We modify Motivation and Background section adding the references suggested by reviewer NXKS. We updated table 1. with 3 additional comparisons. We followed the reviewer's xFzN suggestion and enlarge the experiment's section with additional figure, more detail and discussion on how intuitively and in experiments adaptivity is utillized. We also provided more experimental details and added additional experiment in appendix on phishing dataset in convex regime. We followed the editor suggestion and revised the text stated clearly that in some experiments CLAG shows slightly superior experimental convergence rates.
Supplementary Material: zip
Assigned Action Editor: ~marco_cuturi2
Submission Number: 691