Long-Form Speech Generation with Spoken Language Models

Se Jin Park; Julian Salazar; Aren Jansen; Keisuke Kinoshita; Yong Man Ro; RJ Skerry-Ryan

Long-Form Speech Generation with Spoken Language Models

Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce the first long-form spoken language model (16 min. of audio at once), discuss key design choices (e.g. state-space modeling), and propose new benchmarks.

Abstract: We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive **SpeechSSM**, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: **LibriSpeech-Long**, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/publications/speechssm/.

Lay Summary: Current AI models find it difficult to operate "textlessly", i.e., purely in speech. Though natural speech can be produced, models often become incoherent or repetitive as they keep talking. This limitation hinders the development of realistic voice assistants and engaging multimedia content, where longer conversations are common. To address this challenge, we introduce SpeechSSM, a new type of speech model capable of generating coherent speech lasting several minutes (e.g., 16 minutes of read or extemporaneous speech) without needing any text-based stages during the generation process. SpeechSSM leverages recent improvements in efficient linear-time sequence modeling, enabling it to maintain context and continuity even in lengthy speech generation. We also propose new methods to measure how realistic this extended speech sounds, using AI-based evaluations and embedding-based metrics that consider speech quality over time. Additionally, we provide a new benchmark called LibriSpeech-Long, specifically designed for evaluating long-form speech generation. Our work enables new speech generation applications, enhancing various long-form media, such as audiobooks, podcasts, voice agent sessions, and video-related content.

Link To Code: https://google.github.io/tacotron/publications/speechssm/

Primary Area: Applications->Language, Speech and Dialog

Keywords: spoken language models, long-form generation, state-space models, evaluation

Submission Number: 4403

Loading