Faster Gradient Descent in Deep Linear Networks: The Advantage of Depth

Chandra Shekar Lakshminarayanan; Archish S; Arun Rajkumar; Harish Guruprasad Ramaswamy

Faster Gradient Descent in Deep Linear Networks: The Advantage of Depth

Chandra Shekar Lakshminarayanan, Archish S, Arun Rajkumar, Harish Guruprasad Ramaswamy

27 Sept 2024 (modified: 10 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deep Linear Network; Gradient Descent; Faster Convergence in Finite Time

TL;DR: Prior works have shown that depth plays a negative role in convergence in deep linear networks. We show that depth in fact speeds up convergence.

Abstract: Gradient descent dynamics in deep linear networks has been studied under a wide range of settings. These studies have reported some negative results on the role of depth, in that, gradient descent in deep linear networks: (i) can take exponential number of iterations to converge, (ii) can exhibit sigmoidal learning, i.e., almost no learning in initial phase followed by rapid learning, (iii) can delay convergence with increase in depth. Some of these results are also under stronger assumptions such as whitened data and balanced initialisation. These messages from prior works suggest that depth hurts the speed of convergence. In this paper, we argue that the negative role of depth in the prior works is due to certain pitfalls which can be carefully avoided. We give a positive message on the role of depth, i.e., seen as an additional resource, depth can always be used to speed up convergence. For this purpose, we consider scalar regression with quadratic loss. In this setting, we propose a novel aligned gradient descent (AGD) algorithm for which we show that (i) linear convergence is always possible (ii) depth accelerates the speed of convergence. In AGD, feature alignment happens in first layer and the deeper layers accelerate by learning the right scale. We show acceleration in AGD happens in finite time for unwhitened data. We provide insights into the {acceleration} mechanism and also show that acceleration happens in phases. We also demonstrate the acceleration due to AGD on synthetic and benchmark datasets. Our main message is not propose AGD as a new algorithm in itself, but to demonstrate that depth is an advantage in linear networks thereby dispelling some of the past negative results on the role of depth.

Primary Area: optimization

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11373

Loading