Unsupervised fashion for controllable generations: A manifold control of latent feature using contrastive alignment and self-organizing rewards
- Keywords: controllable sequence generation, reinforcement learning, contrastive learning, self-organizing map
- Abstract: Controllable sequence generation is the task of generating an intended sequence and focuses on developing a joint learning framework of supervised learning (i.e., optimizing MLE objective) and reinforcement learning (i.e., optimizing reward function). Previous works had great success to achieve joint learning by pre-training a sequential generative model with MLE objective and tune the model based on RL framework. Such a two-stage approach (MLE-based pre-training and reward-based tuning) has shown amazing potential and is being widely accepted in various fields such as chatbot, context-aware recommendation, drug discovery, and diet planning. However, the performance of the two-stage approach does highly depend on the status pre-trained model and the model gets to forget what it learned in pre-training step because the learned representations by pre-training are overwritten during the turning step. This is due to the problem that supervised learning (SL) and reinforcement learning (RL) is difficult to be trained at the same time since the former is offline learning while the latter is online learning. To overcome this challenge and put a bridge between offline and online learning, we propose a novel joint learning framework that combines SL and RL in an end-to-end fashion based on unsupervised learning. The proposed model compares every pair of observations with respect to the reward, then it is optimized such that the output sequence gets closer to its target sequence in latent space when the reward of the target sequence is higher than that of the output sequence, and vice versa. The result shows that the proposed model only succeeds in controlling the generative process of sequence within a scalable complexity.