Audiobook synthesis with long-form neural text-to-speech

Weicheng Zhang; Cheng-Chieh Yeh; Will Beckman; Tuomo Raitio; Ramya Rasipuram; Ladan Golipour; David Winarsky

Audiobook synthesis with long-form neural text-to-speech

Weicheng Zhang, Cheng-Chieh Yeh, Will Beckman, Tuomo Raitio, Ramya Rasipuram, Ladan Golipour, David Winarsky

Published: 15 Jun 2023, Last Modified: 23 Jun 2023SSW12Readers: Everyone

Keywords: speech synthesis, prosody, natural language processing, audiobook synthesis, long-form reading, long-context modeling

TL;DR: The goal of this work is to enhance neural TTS to be suitable for long-form content such as audiobooks.

Abstract: Despite recent advances in text-to-speech (TTS) technology, auto-narration of long-form content such as books remains a challenge. The goal of this work is to enhance neural TTS to be suitable for long-form content such as audiobooks. In addition to high quality, we aim to provide a compelling and engaging listening experience with expressivity that spans beyond a single sentence to a paragraph level so that the user can not only follow the story but also enjoy listening to it. Towards that goal, we made four enhancements to our baseline TTS system: incorporation of BERT embeddings, explicit prosody prediction from text, long-context modeling over multiple sentences, and pre-training on long-form data. We propose an evaluation framework tailored to long-form content that evaluates the synthesis on segments spanning multiple paragraphs and focuses on elements such as comprehension, ease of listening, ability to keep attention, and enjoyment. The evaluation results show that the proposed approach outperforms the baseline on all evaluated metrics, with an absolute 0.47 MOS gain in overall quality. Ablation studies further confirm the effectiveness of the proposed enhancements.

5 Replies

Loading