Keywords: quantization, generalization, linear regression
TL;DR: We provide a refined analysis on the excess risk of finite-step stochastic gradient descent for high-dimensional linear regression under a comprehensive range of quantization.
Abstract: The use of low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models. Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent, even in the simplest linear regression setting. We present the first systematic theoretical study of this fundamental question, analyzing finite-step stochastic gradient descent (SGD) for high-dimensional linear regression under a comprehensive range of quantization: data, labels, parameters, activations, and gradients. Our novel analytical framework establishes precise algorithm-dependent and data-dependent excess risk bounds that characterize how different quantization affects learning: parameter, activation, and gradient quantization amplify noise during training; data quantization distorts the data spectrum; and data and label quantization introduce an additional bias error. Crucially, we prove that for multiplicative quantization, this spectral distortion can be eliminated, and for additive quantization, a beneficial scaling effect with batch size emerges. Furthermore, under common polynomial-decay data spectrum scenarios, we quantitatively compare FP and Integer quantization methods, identifying the settings where each is more suitable. Our theory provides a powerful lens to characterize how quantization shapes the learning dynamics of optimization algorithms, paving the way to further explore learning theory under practical hardware constraints.
Primary Area: learning theory
Submission Number: 19457
Loading