Parameter rollback averaged stochastic gradient descent for language model

Zhao Cheng, Guanlin Chen, Wenyong Weng, Qi Lu, Wujian Yang

Published: 01 Jan 2022, Last Modified: 12 May 2023J. Comput. Methods Sci. Eng. 2022Readers: Everyone

Abstract: Recently, AWD-LSTM (ASGD Weight-Dropped LSTM) has achieved good result in the language model, and many AWD-LSTM based models have obtained state-of-the-art perplexities. However, in fact, large-scale neural language models have been shown to be prone to overfitting. In AWD-LSTM original paper, the author decided to adopt the way of retraining calling finetune to get a better result. In this paper, we present a simple yet effective parameter rollback mechanism for neural language models. And we introduce the parameter rollback averaged stochastic gradient descent (PR-ASGD), wherein the parameter “step” in ASGD will decrease according to a certain probability. Using this strategy, we achieve better word level perplexities on Penn Treebank: 56.26 based on AWD-LSTM model and 53.57 based on AWD-LSTM-MoS (AWD-LSTM Mixture of Softmaxes) model.

0 Replies