Layer-wise Quantization for Quantized Optimistic Dual Averaging

Anh Duc Nguyen; Ilia Markov; Zhengqing Wu; Ali Ramezani-Kebrya; Kimon Antonakopoulos; Dan Alistarh; Volkan Cevher

Layer-wise Quantization for Quantized Optimistic Dual Averaging

Anh Duc Nguyen, Ilia Markov, Zhengqing Wu, Ali Ramezani-Kebrya, Kimon Antonakopoulos, Dan Alistarh, Volkan Cevher

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper study adaptive layer-wise compression and optimistic dual averaging for distributed variational inequalities.

Abstract: Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a $150$% speedup over the baselines in end-to-end training time for training Wasserstein GAN on $12+$ GPUs.

Lay Summary: Training advanced AI models across many computers often stalls because of the huge amount of information that must be exchanged. We introduce a technique that squeezes the data shared during training by assigning different compression levels to each layer based on its importance. Key layers receive more precision, while others are represented with fewer bits. We integrate this into a training algorithm called Quantized Optimistic Dual Averaging (QODA), which works seamlessly with compressed data and skips extra synchronization steps. We rigorously prove that, despite the reduced communication, our method converges as reliably as standard uncompressed training. In experiments on image generation and large language models across dozens of GPUs, our approach more than doubles training speed while matching final accuracy. By cutting communication costs and speeding up each training round, our work makes distributed deep learning faster, more scalable, and energy-efficient.

Primary Area: Optimization->Large Scale, Parallel and Distributed

Keywords: Adaptive Compression, Layer-wise Compression, Optimistic Dual Averaging, Distributed Variational Inequality

Submission Number: 11316

Loading