StreamSpeech: Low-Latency Neural Architecture for High-Quality on-Device Speech Synthesis

Georgi Shopov, Stefan Gerdjikov, Stoyan Mihov

Published: 01 Jan 2023, Last Modified: 11 Jun 2024ICASSP 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Neural text-to-speech (TTS) systems have recently demonstrated the ability to synthesize high-quality natural speech. However, the inference latency and real-time factor (RTF) of such systems are still too high for deployment on devices without specialized hardware. In this paper, we describe StreamSpeech – an optimized architecture of a complete TTS system that produces high-quality speech and runs faster than real time with imperceptible latency on resource-constrained devices by utilizing a single CPU core. We divide the standard TTS processing pipeline into three phases with respect to their operating resolution and optimize them separately. Our main novel contribution is the introduction of a lightweight convolutional acoustic model decoder, which enables streaming and low-latency speech generation. Experiments show that the resulting complete TTS system achieves 79 ms latency, 0.155 RTF on a low-power notebook x86 CPU and 276 ms latency, 0.289 RTF on a mid-range mobile ARM CPU with no noticeable difference in the quality of the generated speech.