Abstract: Momentum is a widely adopted technique in the deep neural network (DNN) optimization, recognized for enhancing performance. However, our analysis indicates that momentum is not always beneficial for the network. We theoretically demonstrate that increasing the orthogonality of parameter vectors significantly improves the generalization ability of some common types of DNNs, while momentum tends to reduce this orthogonality. Common DNNs include multilayer perceptrons (MLPs) convolutional neural networks (CNN), and Transformers. Our results further show that integrating normalization and residual connections into commonDNNs helps preserve orthogonality, thereby enhancing the generalization of networks optimized with momentum. Extensive experiments across MLPs, CNNs and Transformers validate our theoretical findings. Finally, we find that the parameter vectors of commonly pre-trained language models (PLMs) all maintain a better orthogonality.
Loading