Parallel Scheduled Sampling

Daniel Duckworth; Arvind Neelakantan; Ben Goodrich; Lukasz Kaiser; Samy Bengio

Parallel Scheduled Sampling

Daniel Duckworth, Arvind Neelakantan, Ben Goodrich, Lukasz Kaiser, Samy Bengio

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: deep learning, generative models, teacher forcing, scheduled sampling

TL;DR: We describe a simple technique to parallelize Scheduled Sampling across time which gives better sample quality and train almost as fast as teacher-forcing.

Abstract: Auto-regressive models are widely used in sequence generation problems. The output sequence is typically generated in a predetermined order, one discrete unit(pixel or word or character) at a time. The models are trained by teacher-forcing where ground-truth history is fed to the model as input, which at test time is replaced by the model prediction. Scheduled Sampling (Bengio et al., 2015) aimsto mitigate this discrepancy between train and test time by randomly replacing some discrete units in the history with the model’s prediction. While teacher-forced training works well with ML accelerators as the computation can be parallelized across time, Scheduled Sampling involves undesirable sequential processing. In this paper, we introduce a simple technique to parallelize Scheduled Sampling across time. Experimentally, we find the proposed technique leads to equivalent or better performance on image generation, summarization, dialog generation, and translation compared to teacher-forced training. n dialog response generation task,Parallel Scheduled Sampling achieves 1.6 BLEU score (11.5%) improvement over teacher-forcing while in image generation it achieves 20% and 13.8% improvement in Frechet Inception Distance (FID) and Inception Score (IS) respectively. Further, we discuss the effects of different hyper-parameters associated with Scheduled Sampling on the model performance.

Original Pdf: pdf

11 Replies

Loading