Convergence Analysis and Trajectory Comparison of Gradient Descent for Overparameterized Deep Linear Networks

Published: 22 Jul 2024, Last Modified: 22 Jul 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper presents a convergence analysis and trajectory comparison of the gradient descent (GD) method for overparameterized deep linear neural networks with different random initializations, demonstrating that the GD trajectory for these networks closely matches that of the corresponding convex optimization problem. This study touches upon one major open theoretical problem in machine learning--why deep neural networks trained with GD methods are efficient in many practical applications? While the solution of this problem is still beyond the reach of general nonlinear deep neural networks, extensive efforts have been invested in studying relevant questions for deep linear neural networks, and many interesting results have been reported to date. For example, recent results on loss landscape show that even though the loss function of deep linear neural networks is non-convex, every local minimizer is also a global minimizer. We focus on the trajectory of GD when applied to deep linear networks and demonstrate that, with appropriate initialization and sufficient width of the hidden layers, the GD trajectory closely matches that of the corresponding convex optimization problem. This result holds regardless of the depth of the network, providing insight into the efficiency of GD in the training of deep neural networks. Furthermore, we show that the GD trajectory for an overparameterized deep linear network automatically avoids bad saddle points.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1.A paragraph discussing that “this paper focuses on deep linear networks and the results may not be directly applicable to practical deep non-linear networks” has been added on page 12. 

2.Additional discussion on the overparameterization conditions related to \(C_0\) has been included in Remark 4.

 3.The discussion about equation (8) has been clarified. 

4.The definition of \(\mathcal{R}(X)\) has been added on page 8. The notation for the convergence region has been changed to reduce confusion.
Assigned Action Editor: ~Ikko_Yamane1
Submission Number: 2455
Loading