Regotron: Regularizing the Tacotron2 Architecture Via Monotonic Alignment Loss

Efthymios Georgiou, Kosmas Kritsis, Georgios Paraskevopoulos, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos

Published: 01 Jan 2022, Last Modified: 15 May 2023SLT 2022Readers: Everyone

Abstract: Deep learning Text-to-Speech (TTS) systems have achieved impressive generated speech quality, close to human parity. However, they suffer from training stability issues and in-correct alignment between the intermediate acoustic representation and the text input. In this work, we propose Regotron, a regularized Tacotron2 version which alleviates the training issues by augmenting the objective function with an additional term, which penalizes non-monotonic alignments in the location-sensitive attention mechanism. By introducing this regularization term we demonstrate its effectiveness to stabilize the training process, produce a monotonic attention quicker (13% of the total number of epochs compared to Tacotron2) and reduce the alignment errors during inference. Moreover, Regotron has minimal additional computational overhead, reduces common TTS mistakes and at the same time achieves improved speech naturalness according to subjective mean opinion scores (MOS) collected from 50 evaluators.

0 Replies