Abstract: Modeling prosody in Text-to-Speech (TTS) is challenging due to ambiguous orthography and the high cost of annotating prosodic events. This study focuses on the modeling of contrastive focus, the emphasis of a word to contrast it to presuppositions held by an interlocutor. Modeling of contrastive focus can be done in TTS by using binary, symbolic inputs at the word level in a supervised setting. To address the absence of annotated data, we propose the Invert-Classify method, which leverages a frozen TTS model and unlabeled parallel text-speech data to recover missing contrastive focus inputs. Our approach achieves a binary F-score of up to 0.71 for contrastive focus annotation recovery, utilizing only 5-10 % of annotated training data. Furthermore, subjective listening tests show that training on additional data labeled via Invert-Classify enhances overall synthesis quality, as well as providing good control and plausible-sounding contrastive focus.
Loading