\section{Training datasets}



\label{app:training-datasets}
\subsection{Interleaved Speech-Text Data}

\input{tables/interleaved-breakdown}

\subsection{Supervised Speech Data}
\label{app:supervised-speech-data}
\input{tables/tts-breakdown}
\input{tables/asr-breakdown}

\subsection{Text-to-Token Model Data}
\label{app:text-to-token-model-data}
\input{tables/tts-llm-breakdown}

The data used to train the text-to-token model consists of two parts: the TTS data presented in~\Cref{tab:tts-breakdown} and the synthesized data of incomplete text spans the generated by CosyVoice, as detailed in~\Cref{tab:tts-llm-breakdown-cosy-voice}.

\section{Training Details}
\subsection{Speech Tokenizer}
\label{appendix_sec:tokenizer_training}
We fine-tune the vector-quantized Whisper model with a collection of ASR datasets, including LibriSpeech~\citep{librispeech}, GigaSpeech~\citep{GigaSpeech}, MLS-Eng~\citep{MLS}, Wenet~\citep{wenet}, CommonVoice~\citep{commonvoice}, AISHELL-1~\citep{aishell1}, and a proprietary Chinese ASR dataset of 10k hours. We also include the unsupervised speech data with pseudo labels generated by Whisper~\citep{whisper} for English and FunASR~\citep{funasr} for Chinese.

All of our speech tokenizers in \Cref{tab:speech_tokenizer} are fine-tuned from \texttt{whisper-large-v3} for 2 epochs with batch size 4096 and learning rate 1e-5. The ratio of supervised samples to pseudo-labeled samples is 1:3. The codebook vectors are updated with exponential moving average with decay coefficient 0.99 and the commitment loss coefficient is 10.0. To reduce the information loss of average pooling, we increase the codebook size as the sampling rate decreases.

\begin{table}[H]
    \centering
    \caption{\textbf{ASR results of Whisper models with pooling layers and vector quantization.} The LibriSpeech (English) is measured with word-error-rate (WER) and AISHELL-1 (Chinese) is measured with character-error-rate (CER). The first model is the original \texttt{whisper-large-v3} without fine-tuning.}
    \begin{tabular}{crrrrrr}
    \toprule
    Model & Sampling & VQ & Finetuned & \multicolumn{2}{c}{LibriSpeech} & AISHELL-1  \\
    & Rate & Codebook &  & test-clean & test-other & test\\
    \midrule
    whisper-large-v3 & 50Hz & - & No & 2.50 & 4.53 & 9.31 \\
    whisper-large-v3 & 50Hz & 4,096 & Yes & 1.85 & 3.78 & 2.70\\
    whisper-large-v3 & 25Hz & 4,096 & Yes & 1.94 & 4.16 & 2.86 \\
    whisper-large-v3 & 12.5Hz & 16,384 & Yes & 2.10 & 4.90 & 3.02 \\
    whisper-large-v3 & 6.25Hz & 65,536 & Yes & 2.48 & 6.34 & 3.33 \\
    \bottomrule
    \end{tabular}
    \label{tab:tokenizer_asr}
\end{table}

During training of speech tokenizers, we measure the semantic information with accuracy of automatic speech recognition (ASR) datasets. We evaluate the finetuned Whisper on LibriSpeech~\citep{librispeech} and AISHELL-1~\citep{aishell1}, along with the original Whisper model. The results are shown in \Cref{tab:tokenizer_asr}. Overall all the tokenizers reserve enough semantic information to achieve accurate ASR performance.


\begin{table}[H]
    \centering
    \caption{Ablation study on block sizes in the block causal attention of speech tokenizers.}
    \begin{tabular}{crrrrrr}
    \toprule
    Model & Attention & Block & \multicolumn{2}{c}{LibriSpeech} & AISHELL-1  \\
    & Type & Size & test-clean & test-other & test\\
    \midrule
    whisper-large-v3 & Bidirectional & - & 3.45 & 5.82 & 4.71 \\
    whisper-large-v3 & Causal & - & 3.13 & 7.10 & 6.27\\
    whisper-large-v3 & Block Causal & 0.5s & 3.85 & 7.30 & 5.70 \\
    whisper-large-v3 & Block Causal & 1s & 3.37 & 6.50 & 5.14 \\
    whisper-large-v3 & Block Causal & 2s & 3.39 & 6.16 & 4.76 \\
    \bottomrule
    \end{tabular}
    \label{tab:tokenizer_block_size}
\end{table}

We also conduct an ablation study on the effect of block size in the block causal attention. In the ablation study we fine-tune the Whisper model with only the supervised ASR datasets for 20,000 steps with batch size 1024. For all the models we use VQ codebook size 4096 and sampling rate 25Hz. The results are shown in \Cref{tab:tokenizer_block_size}.

\subsection{Speech Decoder}
\label{sec:decoder_training}
The speech decoder uses the same architecture as the flow matching model in CosyVoice~\citep{cosyvoice}. The decoder is trained from scratch for 2 epochs with dynamic batching of 20000 frames in a batch and learning rate 1e-3. For simplicity, we remove the speaker embeddings from the flow model. The training datasets include Emilia~\citep{emilia}, Yodas2~\citep{yodas}, Libri-Light~\citep{librilight} and a proprietary Chinese speech dataset. 

\subsection{Text-to-Token Model}
\label{app:text-to-token-model}
The text-to-token model is initialized from a 1.5B pre-trained text LM of the same architecture (further experiments indicate that training from scratch yields the same performance). The training dataset consists of the TTS corpus in~\Cref{tab:tts-breakdown} and~\Cref{tab:tts-llm-breakdown-cosy-voice}, with sampling ratio proportionate to the number of samples in each subset. We use the AdamW~\citep{adamw} optimizer with $\beta_1=0.9$ and $\beta_2=0.95$. The model is trained for 300B tokens with sequence length of 4096 and batch size of 256, learning rate that decays from $4\times10^{-4}$ to $4 \times 10^{-5}$, and weight decay $0.1$. The architecture of the text-to-token model is shown in Table~\ref{tab:model_architecture_ttsllm} (number of speech tokens not included in the vocab size).
\begin{table}[H]
    \centering
    \caption{Model architecture of the text-to-token language model.}
    \label{tab:model_architecture_ttsllm}
    \begin{tabular}{cr}
    \toprule
    \textbf{Hyper-parameters} & \textbf{Value} \\
    \midrule
    Number of layers & 28\\
    Hidden size & 2048 \\
    FFN inter hidden size & 6144\\
    Activation function & SwiGLU \\
    Attention heads & 16 \\
    Attention head size  & 128\\
    Attention group size & 4 \\
    Maximum sequence length & 8192 \\
    Vocab size & 59264 \\
    \bottomrule
    \end{tabular}
\end{table}

\subsection{Speech Language Model}
\begin{table}[H]
    \centering
    \caption{Model architecture of the speech language model.}
    \label{tab:model_architecture}
    \begin{tabular}{crr}
    \toprule
     & 9B & 1.5B \\
    \midrule
    Number of layers & 40 & 28\\
    Hidden size & 4096 & 2048 \\
    FFN inter hidden size & 13696 & 6144\\
    Activation function & SwiGLU & SwiGLU \\
    Attention heads & 32 & 16 \\
    Attention head size & 128 & 128\\
    Attention group size & 2 & 4 \\
    Maximum sequence length & 8192 & 8192 \\
    Vocab size & 151552 & 59264 \\
    \bottomrule
    \end{tabular}
\end{table}

\section{Evaluation Details}
For Spoken StoryCloze and Spoken TopicStoryCloze, we synthesize the speech for contexts and continuations with the provided texts with the TTS engine. When selecting the most probable continuation, the likelihood is normalized by the number of tokens in each continuation.

For Llama Questions, we use the audio files provided in the dataset\footnote{\url{https://github.com/google-research-datasets/LLAMA1-Test-Set}}. We synthesize the speech for Web Questions and TriviaQA with the TTS engine. For TriviaQA, we randomply sample 1,000 samples from the test set of the `rc' setting to match the size of the other two datasets. For all the three datasets of spoken question answering, we add the text prompt ``the answer is'' after the question for both the `S' and `S$\rightarrow$T' settings. For the `S' setting the model generates speech of at most 10 seconds, and for the `S$\rightarrow$T' setting the model generates at most 128 tokens.

\section{Case Study}
\subsection{Spoken Question Answering}
Here we provide examples of spoken question answering from Llama Questions, Web Questions, and TriviaQA. Given a question in speech, the speech language model predicts speech tokens, which are then fed into the speech decoder to get the output audio. We utilize \textit{whisper-large-v3} to transcribe the audio into text.

Dataset: Llama Questions
\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\small
\begin{lstlisting}
(*@\textbf{Question:}@*) What is the capital of France? The answer is
(*@\textbf{Ground Truth:}@*) Paris
(*@\textbf{Prediction:}@*) Paris. The capital of France is Paris. Paris is the capital of the, the, the, the,
\end{lstlisting}
\end{tcolorbox}

Dataset: Web Questions
\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\small
\begin{lstlisting}
(*@\textbf{Question:}@*) Who is governor of ohio 2011? The answer is
(*@\textbf{Ground Truth:}@*) John Kasich
(*@\textbf{Prediction:}@*) John Kasich. What is the name of the governor? 2012. The answer. John Kasich. What is
\end{lstlisting}
\end{tcolorbox}

Dataset: TriviaQA
\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\small
\begin{lstlisting}
(*@\textbf{Question:}@*) Who plays the role of Bubble in Absolutely Fabulous? The answer is
(*@\textbf{Ground Truth:}@*) Jane Horrocks
(*@\textbf{Prediction:}@*) Julia Steiner Julia Steiner is an American actor, singer and dancer. He S he S he
\end{lstlisting}
\end{tcolorbox}

For the examples above we can observe that our model can answer questions about world knowledge.


\subsection{Spoken Chatbot Evaluation}
Here we provide examples of spoken chatbot evaluation. The question and response are transcriptions of the speech input and output.

Task: General QA

\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\small
\begin{lstlisting}
(*@\textbf{Question:}@*) I need to get my nephew a birthday gift. I'm thinking of giving him a game for his Nintendo Switch. Can you give me a few options that came out recently?
(*@\textbf{Response:}@*) What a great nephew you are. Giving a game for his Nintendo Switch is a fantastic idea. Here are some amazing options that just released. Super Mario Odyssey, The Legend of Zelda Tears of the Kingdom, Bookman Sword and Shield, Just Dance, Minecraft, The Elder Scrolls Five Skyrim Anniversary Edition, Portal Knights, Shovel Knight, Celeste, and Overcooked two. These games are highly rated and perfect for any Nintendo Switch player. Choose one that matches his interests and enjoy his birthday.
(*@\textbf{Score:}@*) 3
\end{lstlisting}
\end{tcolorbox}

Task: Knowledge

\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\small
\begin{lstlisting}
(*@\textbf{Question:}@*) How many days does Hanukkah last?
(*@\textbf{Ground Truth:}@*) Eight
(*@\textbf{Response:}@*)  Hanukkah, also known as the Festival of Lights, is an eight-day Jewish holiday that commemorates the rededication of the Temple in Jerusalem during the Maccabean Revolt. It typically falls in November or December, depending on the Hebrew lunar calendar, the festival is observed for eight days and nights, with each day corresponding to one of the eight brands of the menorah. Traditional activities include lighting the menorah, being frayed foods like lakes and sufkat, exchanging gift, playing dreidel and reciting the Hanukkah prayers. The lighting of the menorah is a central ritual. With the first candle lit on the earth's day, additional candles added each day. A festival includes on the eighth day with the lighting of the final candle and the sabration of Shemini Atzeret and Simkat Torah. Some cultures celebrate Hanukkah for Nidaz, including the Ishan al-Dev, Shemini Atzeret and Simkat Torah.
(*@\textbf{Judgment:}@*) Correct
\end{lstlisting}
\end{tcolorbox}


\subsection{Interleaved Pre-training Data}
\label{app:sample-interleaved}
Here we provide examples of speech-text interleaved pre-training data. The following examples are sampled from FineWeb interleaved pre-training dataset with 150k text tokenizer and 12.5Hz speech tokenizer. The speech tokens are transcribed by the corresponding ASR model of the speech tokenizer and are displayed in \textcolor{blue}{blue} and the special token in \textbf{bold}. An extra new line is added before and after the audio segment for clarity.

\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\small
\begin{lstlisting}
Eugene Van Reed
- Died: 1873

(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[53 speech tokens] originally from san francisco, van reed first traveled to japan in}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 1859, where he established his own trading company, dealing in arms among other goods. Named Consul General of
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[34 speech tokens] the hawaiian kingdom in eighteen sixty six}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 he played a key role in
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[43 speech tokens] organizing for the first japanese immigrants to travel to hawaii in}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 1868. This first group of 148, known as the gannenmono, encountered severely harsh working conditions on the sugar plantations, leading to considerable tension between the governments of Japan, Hawaii,
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[40 speech tokens] and the united states, resulting in official japanese,}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 immigration to
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[71 speech tokens] hawaii at not beginning until 1885, after lengthy negotiations}@*) (*@\textbf{<|end\_of\_audio|>}@*)
.
 Van  Reed  died  in   187 3 ,  aboard  a  ship  called  Japan ,  which  he  had  been  taking  home  to  San  Francisco  from  Japan .
\end{lstlisting}
\end{tcolorbox}

\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\small
\begin{lstlisting}
According to declarations from the sector Chambers, Argentina consumed 340 million litres of pesticides and herbicides in the last year; and this quantity is increasing 15% to 20% each year. These poisons are sprayed, fumigated and applied to areas inhabited by 12 million people. For
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[20 speech tokens] a long time the residents}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 of the affected localities have been denouncing to suffer from serious diseases as a consequence of their being contaminated by
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[52 speech tokens] the pestatites this situation was confirmed at the first}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 and 2nd Meeting
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[73 speech tokens] of physicians of the fumigated towns, at the cordoba medical cns of faculty, and,}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 Medical Sciences Faculty of the Rosario National University , in 2010 and 2011, respectively.
There is
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[62 speech tokens] substantial public demand to reclassify pesticides in arginina. this demand}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 is sound: depending on how pesticides
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[77 speech tokens] are classified the provincial and municipal regulations determine the distances between fumigated}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 (sprayed) and inhabited areas.
Currently, the classification is made according to the quantity in milligrams of poison that, when fed to rats, kills 50% of the population tested (Lethal Dose test or LD50); the less the quantity of poisonous substance is required,
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[34 speech tokens] the higher the level of toxicity is attributed}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 to that substance. As such, this form of measurement ignores medium and long term effects, including oncogenic, reproductive
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[43 speech tokens] immunological and endocrine ones in the light of}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 these facts,
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[38 speech tokens] glyphosate should be reclassified as level ib}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 (highly hazardous; the WHO recommended classification of pesticides by hazard), particularly because of the scientific and epidemiological data, showing that its accumulation in the body is connected to
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[49 speech tokens] continental malformations and spontaneous abortions 1}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 3].
Furthermore, the current toxicological classification of acute effects of pesticides
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[24 speech tokens] done taking to account a}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 new set of information and scientific data, which show the acute damage of these poisons for agricultural use on humans, and that are different from
(*@\textbf{<|begin\_of\_audio|>}@*) (*@\textcolor{blue}{[38 speech tokens] the findings in rodents highlighting patterns}@*) (*@\textbf{<|end\_of\_audio|>}@*)
 specific to humans. 
\end{lstlisting}
\end{tcolorbox}


\section{Prompt for Constructing Speech Dialogue Dataset}
\label{app:speech-dialogue-rewrite-prompt-template}

\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\small
\begin{lstlisting}
You are an AI assistant designed to convert text SFT data into SFT data adjusted for speech synthesis tasks. Your task is to generate a modified response suitable for text-to-speech synthesis under the following conditions:

- Exclusion of Unreadable Characters and Number Conversion: Remove any characters that text-to-speech (TTS) systems cannot synthesize, such as *, parentheses (), bullet points, or other special symbols. Convert all numbers into their English word equivalents. For example, convert one to one, twenty to twenty, and so on. Do not include lists or line breaks; the response should be a single paragraph.
- Specificity in Response: Make the response more specific and to the point, avoiding lengthy explanations. Focus on delivering the key message concisely.
- Clarity: Ensure that the response is clear and easy to understand when spoken aloud.
- Avoidance of Code Content: If the prompt suggests writing or generating code, return an empty JSON object. If the prompt only inquires about knowledge related to code without requiring code generation, provide a modified response.

Below is the the conversation input:

[Prompt]: {prompt}
[Response]: {response}

Please output in the following JSON format if the conditions are met: 

```json
{"response": "<modified_response>"}
```

If the prompt is filtered out, output: {}
\end{lstlisting}
\end{tcolorbox}

\section{Prompt Template for Spoken Dialogue}
\label{app:dialogue-prompt-template}

\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\textbf{Direct Generation}
\small
\begin{lstlisting}
<|system|>
User will provide you with a speech instruction. Think about the instruction and speak the response aloud directly. 
<|user|>
<|begin_of_audio|>{Speech Instruction}<|end_of_audio|>
<|assistant|>
<|begin_of_audio|>{Speech Response}<|end_of_audio|>
\end{lstlisting}
\end{tcolorbox}

\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\textbf{Text-guided Generation}
\small
\begin{lstlisting}
<|system|>
User will provide you with a speech instruction. Think about the instruction and speak the response aloud directly. 
<|user|>
<|begin_of_audio|>{Speech Instruction}<|end_of_audio|>
<|assistant|>transcript
{Text Response}
<|assistant|>
<|begin_of_audio|>{Speech Response}<|end_of_audio|>
\end{lstlisting}
\end{tcolorbox}

\section{Prompt for Evaluating Spoken Chatbots}
\label{app:prompt-for-evaluation}

\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\textbf{General QA}
\small
\begin{lstlisting}
[Instruction]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".

[Question]
{instruction}

[The Start of Assistant's Answer]
{response}
[The End of Assistant's Answer]
\end{lstlisting}
\end{tcolorbox}

\begin{tcolorbox}[left=0mm,right=0mm,top=0mm,bottom=0mm,boxsep=1mm,arc=0mm,boxrule=0pt, frame empty, breakable]
\textbf{Knowledge}
\small
\begin{lstlisting}
Your will be given a question, the reference answers to that question, and an answer to be judged. Your tasks is to judge whether the answer to be judged is correct, given the question and reference answers. An answer considered correct expresses or contains the same meaning as at least **one of** the reference answers. The format and the tone of the response does not matter.

You should respond in JSON format. First provide a one-sentence concise analysis for the judgement in field `analysis`, then your judgment in field `judgment`. For example,
```json
{{"analysis": <a one-sentence concise analysis for the judgement>, "judgment": <your final judgment, "correct" or "incorrect">}}
```

# Question
{instruction}

# Reference Answer
{targets}

# Answer To Be Judged
{answer_to_be_judged}
\end{lstlisting}
\end{tcolorbox}


