\begin{figure}[h]
    \centering
    \includegraphics[width=0.95\linewidth]{figures/head-figure-2.pdf}
    \caption{(Left) The performance on Spoken QA continuously improves as the amount of synthetic interleaved data increases, significantly surpassing the previous SOTA (Moshi). (Right) The pipeline for synthesizing interleaved speech-text data.}
    \label{fig:head-figure}
\end{figure}

\section{Introduction}

Large language models (LLMs) have significantly advanced natural language processing, demonstrating capabilities beyond traditional language tasks. Trained on vast internet corpora, they exhibit emergent abilities such as instruction following~\citep{ouyang2022training}, logical reasoning~\citep{wei2022chain}, and tool utilization~\citep{schick2023toolformer}. These advancements have enabled applications like interactive chatbots and personalized digital assistants.
However, an ideal AI assistant should not rely solely on text. Voice-based interaction offers a more natural and intuitive interface for human-AI interaction. Traditional voice-based systems combine Automatic Speech Recognition (ASR), LLMs, and Text-to-Speech (TTS) models in a cascading manner. This approach, however, suffers from information loss during ASR and TTS processes, limiting the ability to capture and express the rich nuances of speech.

Speech language models (SpeechLMs) have emerged as a promising approach for building general-purpose voice assistants capable of processing speech input and output end-to-end. Several methods have been explored to construct SpeechLMs. \citet{GSLM} proposed unsupervised learning on speech corpora using discrete semantic tokens. \citet{TWIST} improved performance by initializing from pre-trained language models, while Moshi~\citep{moshi} utilized large-scale training on private speech data. However, a key challenge remains: the scarcity of speech data compared to text data. While text corpora like FineWeb~\citep{fineweb} offer 15 trillion high-quality tokens, large unsupervised speech datasets like VoxPopuli~\citep{voxpopuli} provide only 400K hours of speech, equivalent to 36 billion tokens at 25Hz. This disparity limits the scalability and performance of SpeechLMs relative to LLMs.

A straightforward idea to address this limitation is to synthesize speech from text pre-training corpora using TTS models. However, this approach faces three major challenges. First, the lower information density of speech tokens leads to significant token expansion, drastically reducing training efficiency. Second, the process of synthesizing speech for large-scale text corpora is computationally expensive. Third, training on pure speech data fails to align with the text modality, preventing the model from leveraging the capabilities of existing LLMs. Recently, \citet{spiritlm} has explored the use of \textit{interleaved speech-text data} for training. This approach improves alignment between speech and text modalities, leading to better speech language modeling performance. However, their method requires parallel speech-text datasets to construct the interleaved data, which significantly limits its scalability for large-scale pre-training.

In this paper, we propose a novel approach to scaling speech-text pre-training by synthesizing interleaved speech-text data from text corpora. The interleaved data is generated by sampling text spans and converting them into speech tokens using a text-to-token model. This efficient process bypasses the need to generate actual speech, enabling large-scale pre-training without relying on extensive speech datasets.
Inspired by \citet{cosyvoice}, we train the tokenizer in a supervised manner using ASR models and datasets. Experiments with sampling rates from 6.25Hz to 50Hz revealed trade-offs between semantic retention, model efficiency, speech reconstruction quality, and pre-training performance. We selected 12.5Hz as the optimal rate for balancing these factors.
To synthesize large-scale interleaved data, we used existing TTS datasets to train a text-to-token model, generating 600B tokens of interleaved speech-text data and expanding the pre-training to 1 trillion tokens.
Finally, through fine-tuning on speech dialogue data, we developed an end-to-end spoken chatbot operating entirely in the speech domain.
The main contributions of this paper are as follows:

\begin{itemize}[leftmargin=*,itemsep=0pt,parsep=0.2em,topsep=0.0em,partopsep=0.0em]
\item We propose a novel method to effectively synthesize high-quality interleaved speech-text data from text corpora, addressing data limitation challenges in speech-text pre-training.
\item We design a SpeechLM architecture featuring a 12.5Hz single-codebook speech tokenizer trained in a supervised manner, along with a flow-matching based decoder for speech reconstruction, achieving both robust semantic preservation and high-quality speech synthesis.
\item We scale our pre-training to 1 trillion tokens using synthesized interleaved speech-text data, significantly advancing capabilities in speech language modeling and spoken question answering.
\item We develop an end-to-end spoken chatbot by fine-tuning pre-trained models with speech dialogue data, achieving competitive performance in conversational abilities and speech quality while operating exclusively in the speech domain.
\end{itemize}
