Keywords: transformer training dynamics, spectral analysis, power-law spectrum, model optimization, neural scaling law
Abstract: Transformers learn from high-dimensional representations, making it challenging to interpret how training shapes the process. In this work, we reveal that hidden activations exhibit consistent power-law heavy-tail spectral decay with exponents $\alpha<1$, gradually increasing across layers and fine-tuning. This spectral evolution offers a compact signature of training dynamics, with larger $\alpha$ values empirically correlating with better generalization. Complementing this, we further find that the gradient SVD spectrum has exponents decreasing over depth, suggesting that gradients become increasingly isotropic as they backpropagate. Together, these spectral signals offer an alternative lens in examining the hidden structure in transformers, which potentially inspire new ways to optimize pre-training and push the scaling-law frontier inward.
Submission Number: 43
Loading