Keywords: Parameter dynamics, theoretical deep learning
TL;DR: We present a general framework for modelling the balancedness between consecutive layers and layer-wise parameter-norm evolution, that unifies several different architecture-specific results in the literature.
Abstract: Understanding the parameter dynamics under gradient-based training has been central to explaining implicit regularization and generalization in deep learning, with balancedness of layers — defined as the difference between the left Gramian of a layer and the right Gramian of the next layer — playing a key role in many existing analyses. We present a unified and substantially more general framework for studying layer-balancedness and parameter-norm dynamics across a broad class of neural architectures. Modeling networks as compositions of learnable Hilbert-Schmidt operators interleaved with fixed positive-homogeneous nonlinearities, we show that consecutive layers without nonlinearities in-between converge exponentially fast toward a balanced state under weight decay. Furthermore, we derive a general expression for the time evolution of the squared-norm of each learnable layer, showing that parameter-norm dynamics reduce to a single scalar quantity: the inner product between the network output and the negative gradient of the loss with respect to it. Our framework recovers existing results as special cases while extending them to architectures beyond the reach of prior, architecture-specific analyses. Finally, it connects parameter evolution to function-space dynamics, which can be studied, for example, using the NTK theory and mean-field analysis.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 118
Loading