Layer-wise Pre-weight Decay

Xiaolong Huang; Qiankun Li; Hanguang Xiao; Gao Xuesong; Xueran Li

Layer-wise Pre-weight Decay

Xiaolong Huang, Qiankun Li, Hanguang Xiao, Gao Xuesong, Xueran Li

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: optimization

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: deep learning, regularization, generalization, weight decay

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: The proposed layer-wise pre-weight decay exhibits remarkable robustness to the weight decay rate, and significantly improves the generalization performance across various conditions.

Abstract: In deep learning, weight decay is a regularization mechanism been widely adopted to improve the generalization performance. Previously, a common understanding of the role of weight decay was that it contributes by pushing the model weights to approach 0 at each time step. However, our findings challenge this notion and argue the objective of weight decay is to make the weights approach the negative value of the update term instead of 0, thereby indicating a delay defect in certain steps that results in opposing penalties. In addition, we study the negative side effect of weight decay, revealing it will damage the inter-layer connectivity of the network while reducing weight magnitude. To address these issues, we first propose real-time weight decay to fix the delay defect by penalizing both the weights and the gradients at each time step. Then, we advance the decay step before the update function as pre-weight decay to mitigate the performance drop raised by the side effect. To further improve the general performance and enhance model robustness towards the decay rate, we finally introduce a layer-wise pre-weight decay to adjust the decay rate based on the layer index. Extensive analytical and comparative experiments demonstrate that the proposed $\textit{layer-wise pre-weight decay}$ (LPWD) (i) exhibits remarkable robustness to the decay rate, and (ii) significantly improves the generalization performance across various conditions.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9078

Loading