EM-Network: Learning Better Latent Variable for Sequence-to-Sequence Models

Ji Won Yoon; SungHwan Ahn; Hyeonseung Lee; Minchan Kim; SeokMin Kim; Nam Soo Kim

EM-Network: Learning Better Latent Variable for Sequence-to-Sequence Models

Ji Won Yoon, SungHwan Ahn, Hyeonseung Lee, Minchan Kim, SeokMin Kim, Nam Soo Kim

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: Connectionist temporal classification, Speech recognition, Machine translation

TL;DR: We propose a new sequence model that can learn a promising latent variable by allowing the target sequence as the model's additional input. It significantly advances the current SOTA approaches in speech recognition and machine translation tasks.

Abstract: In a sequence-to-sequence (seq2seq) framework, the use of an unobserved latent variable, such as latent alignment and representation, is important to address the mismatch problem between the source input and target output sequences. Existing seq2seq literature typically learns the latent space by only consuming the source input, which might produce a sub-optimal latent variable for predicting the target. Extending an expectation-maximization (EM)-like algorithm, we introduce EM-Network that can yield the promising latent variable by leveraging the target sequence as the model's additional training input. The target input is used as guidance to provide the target-side context and reduce the candidates of the latent variable. The proposed framework is trained in a new self-distillation setup, allowing the original sequence model to benefit from the latent variable of the EM-Network. Specifically, the EM-Network's prediction serves as a soft label for training the inner sequence model, which only takes the source as input. We theoretically show that our training objective can be a lower bound for the log-likelihood of the sequence model and is justified from the EM perspective. We conduct comprehensive experiments on two sequence learning tasks: speech recognition and machine translation. Experimental results demonstrate that the EM-Network significantly advances the current state-of-the-art self-supervised learning approaches. It improves over the best prior work on speech recognition and establishes state-of-the-art performance on WMT'14 and IWSLT'14 datasets. Moreover, the proposed method even achieves considerable performance improvement for fully supervised learning.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

53 Replies

Loading