Controllable Emphasis with zero data for text-to-speechDownload PDF

Published: 15 Jun 2023, Last Modified: 30 Jun 2023SSW12Readers: Everyone
Keywords: text-to-speech, emphasis control
Abstract: We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective emphasis method consists in increasing the predicted duration of emphasised word. We show that this is significantly better than signal processing based techniques improving naturalness by $7.3\%$ and identifiability by $40\%$ on a reference female en-US voice, and significantly closing the gaps to methods that require explicit recordings. The method proves to be effective in 4 languages (English, Spanish, Italian, German) for different voices and multiple speaking styles.
3 Replies

Loading