LEGACY: A Lightweight Adaptive Gradient Compression Strategy for Distributed Deep Learning

20 Sept 2024 (modified: 22 Jan 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adaptive Gradient Compression, Gradient Compression, Distributed Deep Learning, Federated Learning, Efficient Communication, Gradient Sparsification, Communication Overhead
TL;DR: In this work, we propose a lightweight and efficient adaptive gradient compression method that changes the compression ratio of each layer based on the layer size and the training iteration.
Abstract: Distributed learning has demonstrated remarkable success in training deep neural networks (DNNs) on large datasets, but the communication bottleneck reduces its scalability. Various compression techniques are proposed to alleviate this limitation; often they rely on computationally intensive methods to determine optimal compression parameters during training and are popularly referred to as adaptive compressors. Instead of the hard-to-tune hyperparameters for adaptive compressors, in this paper, we investigate the impact of two fundamental factors in DNN training, the layer size of the DNNs and their training phases, to design a simple yet efficient adaptive scheduler for any compressors to guide the compression parameters selection. We present a **L**ightweight **E**fficient **G**r**A**dient *C*ompression strateg**Y** or LEGACY that, in theory, can work with any compression technique to produce its simple adaptive counterpart. We benchmark LEGACY on distributed and federated training, involving 6 different DNN architectures for various tasks performed on large and challenging datasets, including ImageNet and WikiText-103. On ImageNet training, by sending similar average data volume, LEGACY's adaptive compression strategies improve the Top-1 accuracy of ResNet-50 by 7%-11%, compared to the uniform Top-0.1% compression used throughout the training. Similarly, on WikiText-103, by using our layer-based adaptive compression strategy and sending similar average data volume, the perplexity of the Transformer-XL improves $\sim$26% more than the uniform Top-0.1% compression used throughout the training. We publish anonymized code at: https://github.com/LEGACY-compression/LEGACY.
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2186
Loading