Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

TMLR Paper3190 Authors

15 Aug 2024 (modified: 25 Nov 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank the reviewers for their detailed and useful reviews that helped us improve our manuscript. We have made changes (in red) to address the comments of the reviewers. The main changes are: 1) Addition of the sketch of a proof at the end of Section 5.1. for the NTK dynamics and 2) Addition of a discussion section with a summary of the contributions, limitations, alternative settings and further extensions. Please see the individual responses to reviewers for more specific comments.
Assigned Action Editor: ~Atsushi_Nitanda1
Submission Number: 3190
Loading