- Abstract: Sequence generation models are commonly refined with reinforcement learning over user-defined metrics. However, high gradient variance hinders the practical use of this method. To stabilize this method for contextual generation of categorical sequences, we estimate the gradient by evaluating a set of correlated Monte Carlo rollouts. Due to the correlation, the number of unique rollouts is random and adaptive to model uncertainty; those rollouts naturally become baselines for each other, and hence are combined to effectively reduce gradient variance. We also demonstrate the use of correlated MC rollouts for binary-tree softmax models which reduce the high generation cost in large vocabulary scenarios, by decomposing each categorical action into a sequence of binary actions. We evaluate our methods on both neural program synthesis and image captioning. The proposed methods yield lower gradient variance and consistent improvement over related baselines.
- Code: https://drive.google.com/file/d/1Af53I2DG-8dQGffLKuFbO2yUGJOGTxsv/view?usp=sharing
- Keywords: binary softmax, discrete variables, policy gradient, pseudo actions, reinforcement learning, variance reduction