SpeakStream: Streaming Text-to-Speech with Interleaved Data

OpenReview Anonymous Preprint Submission663 Authors

06 Mar 2025Anonymous Preprint SubmissionEveryoneCC BY-NC-ND 4.0

Keywords: TTS, Streaming, Online

TL;DR: A dual-streaming TTS system with extra low latency

Abstract: There has been an increasing integration of speech front-ends and large language models (LLM) with end-to-end models but cascaded models that stream LLM outputs to text-to-speech~(TTS) systems remain surprisingly under-explored despite their simplicity. Using traditional TTS to convert LLM outputs to audio, however, poses a technical problem because entire utterances are needed to generate stylistic audio. In this paper we present a streaming TTS (SpeakStream) that can generate audio incrementally from streaming text using a decoder-only architecture. The model is trained using next-step prediction loss on force-aligned, interleaved text-speech data. During inference SpeakStream generates speech incrementally while absorbing streaming text, making it suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments show that SpeakStream matches batch TTS quality while enabling streaming capabilities.

Submission Number: 663