Regotron: Regularizing the Tacotron2 Architecture Via Monotonic Alignment LossDownload PDFOpen Website

Published: 01 Jan 2022, Last Modified: 15 May 2023SLT 2022Readers: Everyone
Abstract: Deep learning Text-to-Speech (TTS) systems have achieved impressive generated speech quality, close to human parity. However, they suffer from training stability issues and in-correct alignment between the intermediate acoustic representation and the text input. In this work, we propose Regotron, a regularized Tacotron2 version which alleviates the training issues by augmenting the objective function with an additional term, which penalizes non-monotonic alignments in the location-sensitive attention mechanism. By introducing this regularization term we demonstrate its effectiveness to stabilize the training process, produce a monotonic attention quicker (13% of the total number of epochs compared to Tacotron2) and reduce the alignment errors during inference. Moreover, Regotron has minimal additional computational overhead, reduces common TTS mistakes and at the same time achieves improved speech naturalness according to subjective mean opinion scores (MOS) collected from 50 evaluators.
0 Replies

Loading