Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning
Abstract: We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Camera-ready version
Code: https://github.com/juho-lee/asymmetrical_scaling
Supplementary Material: pdf
Assigned Action Editor: ~Atsushi_Nitanda1
Submission Number: 3190
Loading