## Data-dependent Gaussian Prior Objective for Language Generation

Sep 25, 2019 ICLR 2020 Conference Blind Submission readers: everyone Show Bibtex
• TL;DR: We introduce an extra data-dependent Gaussian prior objective to augment the current MLE training, which is designed to capture the prior knowledge in the ground-truth data.
• Abstract: For typical sequence prediction problems like language generation, maximum likelihood estimation (MLE) has been commonly adopted as it encourages the predicted sequence most consistent with the ground-truth sequence to have the highest probability of occurring. However, MLE focuses on a once-for-all matching between the predicted sequence and gold-standard consequently, treating all incorrect predictions as being equally incorrect. We call such a drawback {\it negative diversity ignorance} in this paper. Treating all incorrect predictions as equal unfairly downplays the nuance of these sequences' detailed token-wise structure. To counteract this, we augment the MLE loss by introducing an extra KL divergence term which is derived from comparing a data-dependent Gaussian prior and the detailed training prediction. The proposed data-dependent Gaussian prior objective (D2GPo) is defined over a prior topological order of tokens, poles apart from the data-independent Gaussian prior (L2 regularization) commonly adopted for smoothing the training of MLE. Experimental results show that the proposed method can effectively make use of more detailed prior in the data and significantly improve the performance of typical language generation tasks, including supervised and unsupervised machine translation, text summarization, storytelling, and image caption.