\section{Experiments}

\input{tables/pretrain_results}

\subsection{Experimental Setup}

\paragraph{Configuration} We employ \texttt{GLM-4-9B-Base}~\citep{chatglm} as our base LLM for experiments. For ablation, we also use a smaller LLM with 1.5 billion parameters detailed in~\Cref{tab:model_architecture}. Our speech-text pre-training stage processes a total of 1T tokens, with a fixed sampling of 30\% text data, one epoch each of unsupervised speech and supervised speech-text data, and the remainder consisting of interleaved data. Throughout the pre-training stage, we maintain a sequence length of 8192 tokens and use a learning rate that linearly decays from 6e-5 to 6e-6. For the fine-tuning phase, we use a batch size of 64, a sequence length of 4096 tokens, and train for 10 epochs on the fine-tuning dataset with a learning rate decaying from 5e-5 to 5e-6. We use the AdamW optimizer for both pre-training and fine-tuning stages.
\vspace{-0.5em}
\paragraph{Baselines}
For pre-trained models, We compare our method with GSLM~\citep{GSLM}, AudioLM~\citep{AudioLM}, TWIST~\citep{TWIST}, Spirit-LM~\citep{spiritlm}, SpeechGPT~\citep{speechgpt}, Spectron~\citep{Spectron}, and Moshi~\citep{moshi}. Except GSLM and AudioLM, other baselines are based on a text pretrained language model. Note that Moshi is pretrained on a speech collection of 7 million hours, an order of magnitude larger than our unsupervised speech data. For chat, we only compare end-to-end spoken chatbots supporting speech as both input and output, we choose SpeechGPT~\citep{speechgpt}, Llama-Omni~\citep{llama-omni}, Mini-Omni~\citep{mini-omni} and Moshi~\citep{moshi}. 
%\todo{Eval details for mini-omni, moshi}
Moshi is fine-tuned for full duplex conversations and each conversation must begin with a greeting from the model. Therefore, we wait 3 seconds for the greeting to end before asking the speech query. For Mini-Omni, we use the default AT mode for evaluation.
\vspace{-0.5em}
\paragraph{Speech Language Modeling}
We first evaluate the pretrained model's ability to model speech by the accuracy of selecting the correct continuation of a given context according to the predicted likelihood. We consider three different settings: from speech context to speech continuation (denoted as `S'), from text context to speech continuation (denoted as `T$\rightarrow$S'), and from speech context to text continuation (denoted as `S$\rightarrow$T'). We use two datasets proposed by \citet{TWIST}, Spoken StoryCloze and Spoken TopicStoryCloze. The baseline results are taken from \citet{spiritlm,moshi}.
\vspace{-0.5em}
\paragraph{Spoken Question Answering}
Similar to closed-book question answering in NLP, spoken question answering requires the speech-language model to answer spoken questions about broad factual knowledge without access to external knowledge. We consider two settings for the model: from spoken questions to spoken answers (denoted as `S'), and from spoken questions to textual answers (denoted as `S$\rightarrow$T'). We evaluate our model on 3 datasets in \citet{moshi}: Web Questions~\citep{webquestions}, Llama Questions~\citep{Spectron}, and TriviaQA~\citep{TriviaQA}. The baseline results are taken from \citet{moshi}.
\vspace{-0.5em}
\paragraph{Evaluating Spoken Chatbots} To evaluate the spoken chatbot's capabilities we select two aspects: general question answering and knowledge. For general question answering, we utilized prompts from AlpacaEval's~\citep{alpaca_eval} \texttt{helpful\_base} and \texttt{vicuna} categories, which are more suitable for voice interactions. The knowledge assessment drew 100 questions from Web Questions, Llama Questions, and TriviaQA datasets. The generated speech was transcribed into text using \texttt{whisper-large-v3}, and GPT-4 was used to score the responses on a scale of 1 to 10, with the detailed prompt provided in~\Cref{app:prompt-for-evaluation}. Additionally, we measured ASR-WER to assess the alignment between generated speech and text, as well as UTMOS~\citep{utmos} to evaluate overall speech quality following~\citet{llama-omni}. 
\vspace{-1em}
\subsection{Main Results}
The evaluation results for the pretrained model are shown in \Cref{tab:pre-training}. On speech language modeling, our method outperforms baselines on all the tasks except the `S' setting of spoken Topic-StoryCloze, on which our model achieves comparable accuracy to SpiRit-LM and Moshi. Compared with SpiRit-LM, our method achieves significant improvements on the `T$\rightarrow$S' and `S$\rightarrow$T' setting, indicating that our synthetic interleaved data effectively aligns text and speech modalities. On spoken question answering, our method significant outperforms all the baselines on both the `S' and `S$\rightarrow$T' setting of three datasets. The improvements are especially substantial on the speech-to-speech setting. On Llama Questions, our method considerably reduces the previous gap between the speech-to-speech and speech-to-text settings, indicating that it effectively transfers the knowledge in the text modality to the speech modality. Overall, our method achieves better performance than the best baseline Moshi, with only a tenth of Moshi's natural speech data. 

\input{tables/chat_results}

\Cref{tab:alignment} shows the evaluation results for spoken chatbots. Our 9B text-guided model outperforms all baseline models in general question-answering and knowledge-based tasks. It also achieves better results in speech quality evaluation compared to others. Notably, even without text guidance, the 9B model still performs comparably with text-guided baselines, highlighting our method’s effectiveness in aligning text and speech modalities.

\vspace{-1em}
\subsection{Ablation Study}
\input{tables/pretrain_ablation}
\subsubsection{Data Scaling and Composition}

Our pre-training corpus consists of text, speech data, speech-text interleaved data, and speech-text parallel data (from ASR and TTS tasks). We study the effects of data scaling and composition.
First, we evaluate how scaling interleaved data impacts model performance. We train the 9B model with interleaved data sizes of 0, 100B, and 200B tokens, keeping other parts of the pre-training corpus unchanged. \Cref{tab:pre-training-ablation} compares these results with the best model trained on 600B interleaved data. Without interleaved data, the model performs poorly, but as interleaved data scales up, performance improves consistently. This demonstrates the effectiveness of scaling synthetic speech-text interleaved data. \Cref{fig:interleave_data_tokens} further shows that increasing interleaved data improves chatbot performance after supervised fine-tuning, both with and without text guidance. Next, we analyze the contributions of different parts of the pretraining corpus using a 1.5B model. Results in \Cref{tab:pre-training-ablation} show that removing synthetic interleaved data significantly degrades performance. Removing unsupervised speech data slightly reduces spoken question answering accuracy, while removing text or speech-text parallel data improves performance on most benchmarks, likely due to capacity competition among modalities in smaller models. For the 9B models, we retain all data types as they represent essential tasks for downstream applications, and larger models alleviate this competition.

\vspace{-1em}
\subsubsection{Sampling Rate}
\label{sec:ablation-sample-rate}

The sampling rate of the speech tokenizer refers to the number of speech tokens generated per second. \citet{TWIST} observed that reducing HuBERT's sampling rate from 50Hz to 25Hz improved performance on speech language modeling tasks. We trained 1.5B models with tokenizers at different sampling rates using the same number of training tokens, excluding ASR and TTS datasets for simplicity, and analyzed the relationship between sampling rate and accuracy (\Cref{fig:sampling_rate}).
The results show that lower sampling rates improve average accuracy. We hypothesize two reasons: (1) lower sampling rates allow the model to process more speech data within the same training token budget, and (2) shorter token sequences for the same audio reduce modeling difficulty. We selected a 12.5Hz sampling rate for our main model, as the 6.25Hz tokenizer showed a trade-off where speech information loss outweighed accuracy gains.

\vspace{-1em}
\subsubsection{Span Corruption Ratio}
\label{sec:ablation-span-corruption-ratio}

The span corruption ratio decides the proportions of text and speech tokens in interleaved samples. With extreme corruption ratios close to 0 or 1 the interleaved samples are dominated by text or speech tokens. To study the effect of the ratio and determine the best value, we train multiple 1.5B models with interleaved data from different span corruption ratios and plot the results in \Cref{fig:span_corruption_ratio}. Overall, we find that the corruption ratios from 0.2 to 0.4 works well. Larger or smaller ratios result in a significant degradation of performance. Based on the results, we select 0.3 as the corruption ratio for our main model.
