Optimal Completion Distillation for Sequence Learning

Sara Sabour; William Chan; Mohammad Norouzi

Optimal Completion Distillation for Sequence Learning

Sara Sabour, William Chan, Mohammad Norouzi

Published: 21 Dec 2018, Last Modified: 22 Jun 2025ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pre-training or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution which puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving $9.3\%$ WER and $4.5\%$ WER, respectively.

Keywords: Sequence Learning, Edit Distance, Speech Recognition, Deep Reinforcement Learning

TL;DR: Optimal Completion Distillation (OCD) is a training procedure for optimizing sequence to sequence models based on edit distance which achieves state-of-the-art on end-to-end Speech Recognition tasks.

Code: [![Papers with Code](/images/pwc_icon.svg) 2 community implementations](https://paperswithcode.com/paper/?openreview=rkMW1hRqKX)

Data: [LibriSpeech](https://paperswithcode.com/dataset/librispeech)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/optimal-completion-distillation-for-sequence/code)

14 Replies

Loading