- TL;DR: An entropy regularized policy optimization formalism subsumes a set of sequence prediction learning algorithms. A new interpolation algorithm with improved results on text generation and game imitation learning.
- Abstract: Sequence prediction models can be learned from example sequences with a variety of training algorithms. Maximum likelihood learning is simple and efficient, yet can suffer from compounding error at test time. Reinforcement learning such as policy gradient addresses the issue but can have prohibitively poor exploration efficiency. A rich set of other algorithms, such as data noising, RAML, and softmax policy gradient, have also been developed from different perspectives. In this paper, we present a formalism of entropy regularized policy optimization, and show that the apparently distinct algorithms, including MLE, can be reformulated as special instances of the formulation. The difference between them is characterized by the reward function and two weight hyperparameters. The unifying interpretation enables us to systematically compare the algorithms side-by-side, and gain new insights into the trade-offs of the algorithm design. The new perspective also leads to an improved approach that dynamically interpolates among the family of algorithms, and learns the model in a scheduled way. Experiments on machine translation, text summarization, and game imitation learning demonstrate superiority of the proposed approach.
- Code: https://drive.google.com/file/d/13diaxzuxTSB-DReqEhkYPMmZ4BQ6vsEo/view
- Keywords: Sequence generation, sequence prediction, reinforcement learning