Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

Francois Caron; Fadhel Ayed; Paul Jung; Hoil Lee; Juho Lee; Hongseok Yang

Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

Francois Caron, Fadhel Ayed, Paul Jung, Hoil Lee, Juho Lee, Hongseok Yang

Published: 18 Feb 2025, Last Modified: 18 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: Camera-ready version

Code: https://github.com/juho-lee/asymmetrical_scaling

Supplementary Material: pdf

Assigned Action Editor: ~Atsushi_Nitanda1

Submission Number: 3190

Loading