Abstract: The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum.
To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We summarize the main changes in the revision:
(1) We have added Section 8, which contains numerical results. This addresses a comment by Reviewer Vu68.
(2) In Section 2, page 3, we have added a paragraph to discuss the mean-field limit for different parameterizations, including the “maximal-update” one. This addresses a comment by Reviewer LtDr.
(3) We have added some clarifications around equation (3) and (4) in section 3.2, page 4. This addresses a comment by Reviewer Vu68.
(4) As suggested by Reviewer Vu68, we have added the domain and co-domain of each function(al)s defined in equation (11) in the line below, page 6.
(5) We have added a paragraph explaining the relationship between the neuronal embedding framework and the non-linear dynamics before Theorem 4.1, page 7. This addresses a comment by Reviewer Vu68.
(6) We have modified the statement of Theorem 4.1 by adding the condition “For any $t > 0$”, and we have added an explanatory paragraph after the theorem in page 7. This addresses a comment by Reviewer Vu68.
(7) In the paragraph above equation (25), page 10, we have remarked that proving only the consistency of the mean-field limit does not suffice to obtain guarantees on the dropout stability and connectivity of SHB solution. This addresses a comment by Reviewer ESTq.
(8) We have discussed part (C3) of Assumption 7.1 in the paragraph above Theorem 7.2, in page 11. This addresses a comment by Reviewer ESTq.
(9) We have added a paragraph discussing our technical contribution in section 9, page 13. This addresses a comment by Reviewer ESTq.
(10) We have changed the title of the last paragraph of Section 9 from “Generalization” to “Comparison between SGD and heavy ball methods”. We have also extended the discussion, in order to address a comment by Reviewer ESTq.
Video: https://github.com/DiyuanWu/mean-field-heavy-ball
Assigned Action Editor: ~Murat_A_Erdogdu1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 500
Loading