\section{Our Approach}

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figures/arch-overview.pdf}
    \caption{\textbf{Overview of our method.} First we train a text-to-token model to construct interleaved speech-text data. The speech language model's training contains two stages. In the stage 1 the model is pre-trained with synthetic speech-text interleaved data. In the stage 2 the the model is fine-tuned with a speech dialogue dataset.}
    \label{fig:arch-overview}
\end{figure}

% To date, there has been no end-to-end solution for a spoken chatbot that uses a single language model to directly process both speech input and output.
Current approaches for build SpeechLMs typically fall into two categories. One method~\citep{llama-omni, moshi} involves the language model for speech input but outputs embeddings for an additional non-autoregressive (NAR) model to generate speech tokens, which limits the modeling capacity and potentially reduces the upper bound of performance. The other method~\citep{mini-omni} uses inconsistent audio representations for input and output, leading to misalignment between input and output modalitiy. 

In this section, we present our approach for developing an end-to-end spoken chatbot using a unified speech-text modeling framework. Our method integrates a supervised speech tokenizer, a technique for synthesizing interleaved speech-text data, and a two-stage training process to extend pre-trained language models to the speech domain. This comprehensive approach enables us to leverage large-scale text data for speech modeling, effectively aligning speech and text modalities within a single model.

\subsection{Speech Tokenization}
% \begin{wrapfigure}{r}{0.5\textwidth}
%     \centering
%     \vspace{-1em}
%     \includegraphics[width=0.5\textwidth]{figures/tokenizer-arch.pdf}
%     \label{fig:tokenizer-arch}
%     \vspace{-1em}
%     \caption{Speech Tokenizer and Speech Decoder Architecture.}
% \end{wrapfigure}
\paragraph{Supervised Speech Tokenizer}
Previous methods of discrete speech tokenizers are either trained with 
reconstruction/adversarial objectives of speech waveform~\citep{valle,valle2} or self-supervised learning on automatically discovered acoustic units\citep{hubert}. Following recent advance in text-to-speech synthesis~\citep{cosyvoice}, we train the discrete speech tokenizer by fine-tuning a pretrained automatic speech recognition (ASR) model with an additional pooling layer and a vector quantization layer in the middle of the encoder.
% Specifically, given an encoder-decoder ASR model (we use \texttt{whisper-large-v3} in the Whisper family~\citep{whisper}), whose encoder is a stack of $L$ Transformer layers, we keep the first $L/2$ layers unchanged,
% \begin{equation}
    % H^{(l)}=\text{Layer}^{(l)}(H^{(l-1)}),l=1,2,\cdots,L/2
% \end{equation}
% where $H^{(0)}$ is the sequence of raw audio representations. After the $L/2$-th layer we add a 1D average pooling operator $\mathrm{AvgPool}_k$ of window size $k$, which reduces the sampling rate to a fraction of $1/k$, and a vector quantization layer $\mathrm{VQ}(\cdot, E)$ with codebook $E\in R^{K\times D}$, where $K$ is the codebook size and $D$ is the hidden dimension,
% \begin{align*}
    % \bar{H}^{(L/2)}&=\mathrm{AvgPool}_k(H^{(L/2)}) \\
    % z^{(L/2)}&=\mathrm{VQ}(\bar{h}^{(L/2)}, E)=\argmin_{e\in E}\|\bar{h}^{(L/2)}-e\|_2\\
% \end{align*}
% We use $H$ and $Z$ to represent the sequence and $h$ and $z$ to represent a single vector in the sequence. The quantized vector sequence $Z^{(L/2)}$ is then passed through the remaining $L/2$ layers and the ASR decoder.

% The $\argmin$ operator in vector quantization is discrete and indifferentiable. Following \citet{vqvae}, we use the straight-through estimator to estimate the gradient of $\bar{h}^{(L/2)}$ with that of $z^{(L/2)}$, update the codebook with exponential moving average (EMA), and add a commitment loss $\|\mathrm{sg}(\mathrm{VQ}(\bar{h}^{(L/2)}, E))-\bar{h}^{(L/2)}\|_2^2$ to restrict the volume of $\bar{h}^{(L/2)}$. In practice, we find that the training process suffers from codebook collapse, wherein most vectors in the codebook are not used. Therefore we apply the random restart trick~\citep{jukebox} to reset the vectors whose mean usage falls below a threshold to vectors randomly selected from $\bar{H}^{(L/2)}$.
The pooling layer is a 1D average pooling operator of window size $k$, which reduces the sampling rate to a fraction of $1/k$. The vector quantization layer approximates the continuous intermediate representations in the encoder with the closest vectors in the codebook. The selected indices in the codebook are used as the speech token indices. The codebook vectors are learned with exponential moving average (EMA) and we add a commitment loss to restrict the volume of continuous representations before quantization. To overcome codebook collapse, we apply the random restart trick~\citep{jukebox} to reset vectors whose mean usage falls below a certain threshold.

We also adapt the Whisper architecture to support streaming inference, which is important to reduce latency for online speech interaction. We replace the convolution layer before the encoder Transformer with the causal convolution layer~\citep{wavenet}. We also replace the bidirectional attention in the encoder with block causal attention: the input audios are divided into segments of equal intervals and positions in a segment and attend to all the positions in the current segment and previous segments, but not positions in the following segments. Empirically we set the segment interval to 2 seconds (100 tokens before the average pooling). We find this can match the ASR performance of bidirectional attention. For more details about speech tokenizer training, please refer to \Cref{appendix_sec:tokenizer_training}.

\input{tables/speech_tokenizer}

\paragraph{Speech Decoder} Given discrete speech tokens, we synthesize speech through the speech decoder. We follow the decoder architecture of CosyVoice~\citep{cosyvoice}, which consists of a speech token encoder, a conditional flow matching model~\citep{mehta2024matcha}, and a HiFi-GAN vocoder~\citep{hifi_gan}.
The speech token encoder converts a sequence of discrete tokens into a sequence of contextual vectors with a Transformer encoder. To facilitate the streaming synthesis of speech, we adapt the speech token encoder to use the same block causal attention as the speech tokenizer. The flow matching model generates Mel spectrograms conditioned on the speech token representations. Finally, the generated Mel spectrograms converted into the speech waveforms through the HiFi-GAN vocoder~\citep{hifi_gan}. To train the speech decoder, we use the unsupervised speech data described in \Cref{subsubsec:pretraining}, which consists of various speakers. For more details about speech decoder training, please refer to \Cref{sec:decoder_training}.
% To facilitate the streaming synthesis of speech from discrete speech tokens, we employ a block-wise encoder to model the representation of speech tokens. Block-wise encoder enables the speech decoder to synthesize audio in fixed block sizes. In practice, we set the block size to 2 seconds, the same as the speech tokenizer. For a 25Hz speech tokenizer, the speech decoder synthesizes audio after receiving 50 speech tokens.
% whereas for a 12.5Hz tokenizer, the decoder generates audio after every 25 tokens. 
% Specifically, give speech tokens $\mathbf{s}$, their representation $\mathbf{e}$ is calculated with a embedding layer $\mathrm{Emb}(\cdot)$ and a block-wise Transformer $\mathrm{BlockTransformer}(\cdot)$.
% \todo{The following formula seems strange in this paragraph}
% \begin{gather}
%     \mathbf{e}=\mathrm{BlockTransformer}\left ( \mathrm{Emb}\left ( \mathbf{s} \right ) \right ).
% \end{gather}
% In the block-wise Transformer, each token can only attend to tokens within the current block and those in preceding blocks, while information from subsequent blocks is masked out. Then, following \citet{cosyvoice} and \citet{mehta2024matcha}, we utilize an optimal-transport conditional flow matching model to learn the distribution of Mel spectrograms and generate samples from it conditioned on the speech token representation $\mathbf{e}$. Finally, the generated Mel spectrograms are passed through a HiFi-GAN vocoder~\citep{hifi_gan}, which converts them into the speech waveform.

% Following \citet{cosyvoice}, we utilize an optimal-transport conditional flow matching model to transform discrete speech tokens into Mel spectrograms through a denoising process along the optimal transport path. Continuous-time normalizing flows (CNFs) facilitate the establishment of a probability density path that transitions from a prior distribution $p_{0}\left( X \right)$ to the distribution of Mel spectrograms $q\left( X \right)$, which is defined by a time-dependent vector field $\upsilon \left ( X \right ):\left [ 0,1 \right ]\times \mathbb{R}^{N\times D}\to  \mathbb{R}^{N\times D}$. The flow $\phi_{t}$ is generated through the ordinary differential equation:
% \begin{gather}
%     \frac{d}{d_{t}}\phi _{t}\left ( X \right )=\upsilon \left ( \phi _{t}\left ( X \right ),t \right ), \;\;\text{where}\;\phi _{0}\left ( X \right )\sim p_{0}\left ( X \right )=\mathcal{N}\left ( X;0,I \right ),\phi _{1}\left ( X \right )\sim p_{1}\left ( X \right ).
% \end{gather}

% To estimate the conditional probabilities of Mel spectrograms generated from speech tokens, we use the optimal transport (OT) flow and the vector field is parameterized as a neural network with weights $\theta$, taking representation of speech tokens $\mathbf{e}$ and the masked Mel spectrogram $\tilde{X}_{1}$ as inputs:
% \begin{gather}
%     \upsilon \left ( \phi _{t}^{\mathrm{OT}}\left ( X_{0},X_{1} \right )\mid \theta  \right )=\mathrm{NN}_{\theta }\left ( \phi _{t}^{\mathrm{OT}}\left ( X_{0},X_{1} \right ),t;\mathbf{e},\tilde{X}_{1}  \right ),
% \end{gather}
% where $\tilde{X}_{1}$ is a masked version of $X_{1}$ by setting continuous frames to zeros from a random start point to the end and $\mathbf{e}$ is representation of speech tokens $\mathbf{s}$. To facilitate the streaming synthesis of speech from discrete speech tokens, we employ a block-wise encoding of speech tokens. This enables the speech decoder to synthesize audio in fixed block sizes. For instance, with a block size of 50, the speech decoder synthesizes audio each time it receives 50 speech tokens. Specifically, give speech tokens $\mathbf{s}$, their representation $\mathbf{e}$ is calculated with a embedding layer $\mathrm{Emb}(\cdot)$ and a block-wise Transformer $\mathrm{BlockTransformer}(\cdot)$:
% \begin{gather}
%     \mathbf{e}=\mathrm{BlockTransformer}\left ( \mathrm{Emb}\left ( \mathbf{s} \right ) \right ).
% \end{gather}
% In a block-wise Transformer, each token can only attend to tokens within the current block and those in preceding blocks, while information from subsequent blocks is masked out. Finally, to learn the vector field, optimal-transport conditional flow matching model is optimized by minimizing the following loss:
% \begin{gather}
%     \mathcal{L}_\mathrm{OT-CFM}=\mathbb{E}_{t,p_{0}\left ( X_{0} \right ),q\left ( X_{1} \right )}\left|\omega_{t}^{\mathrm{OT}}\left ( \phi_{t}^{\mathrm{OT}}\left ( X_{0},X_{1} \right )\mid X_{1}   \right )-\upsilon \left ( \phi _{t}^{\mathrm{OT}}\left ( X_{0},X_{1} \right )\mid \theta  \right ) \right|.
% \end{gather}
% Follow previous speech synthesis methods \citep{cosyvoice}, we also utilize a cosine scheduler for the timestep and Classifier-Free Guidance (CFG) to enhance the performance and stability of the speech decoder.

% $\times N_{up}$,$\times N_{mid}$,$\times N_{down}$,$\times N_{enc}$ $X_{t}$, $\tilde{X}$

We evaluate the content preservation and quality of generated speech by our speech decoder on LibriSpeech~\citep{librispeech}. The results are shown in \Cref{tab:speech_tokenizer}. We measure the content preservation by the Word Error Rate (WER) between the transcription with an ASR model provided in \citet{Expresso} and the true transcription. For speech quality, following~\citet{moshi}, we compute VisQOL~\citep{visqol} and MOSNet~\citep{MOSNet} of the reconstructed speech. Our tokenizer performs well across various sampling rates, with the 12.5Hz variant offering an optimal balance between efficiency and quality. It maintains high quality scores (MOSNet 3.39) and content preservation (WER 8.43) while significantly reducing bitrate (52.7).
Our ablation study on sampling rates during pre-training (Cf.~\Cref{sec:ablation-sample-rate}) shows that lower rates improve performance, but gains plateau at 12.5Hz.
Based on these results, we select the 12.5Hz variant for our subsequent experiments.

\subsection{Synthesize Interleaved Speech-Text Data}
\label{sec:interleaved-data}
Interleaved speech-text data consists of tokens where speech and text sequences are interleaved at the word level. For example, a sequence might look like: ``Today is \texttt{<Speech\_24>} \texttt{<Speech\_5>} ... \texttt{<Speech\_128>} day".
We hypothesize that training on interleaved speech-text data encourages the model to learn an alignment between speech and text, facilitating the transfer of text-based knowledge to speech representations.
Previous methods for creating interleaved speech-text data rely on aligned speech-text parallel datasets~\citep{spiritlm}, which are challenging to obtain.
We propose a novel and efficient approach for constructing interleaved speech-text data using existing text datasets. The process consists of two main steps. First, we train a text-to-token model that directly converts text into corresponding speech tokens, eliminating the need to synthesize actual speech. This approach avoids the error accumulation associated with text-to-speech-to-token pipelines and significantly improves synthesis efficiency, making it practical and scalable for large-scale data generation. Next, we sample text spans from existing text datasets and transform them into speech spans using the trained text-to-token model. This enables the efficient and scalable creation of interleaved speech-text data without requiring aligned speech-text parallel datasets.

\paragraph{Text-to-Token Model} We train a 1.5B text-to-token model based on standard transformer architecture to convert text into corresponding speech tokens. While these tokens can be further synthesized into actual speech using our speech decoder, this step is unnecessary for constructing interleaved speech-text data.
To prepare the training data, we first tokenize speech from text-to-speech datasets into discrete speech tokens. The text-to-token model is then trained to predict these speech token sequences based on the input text. The training objective is to minimize the negative log-likelihood of the predicted speech tokens conditioned on the corresponding text input:
\begin{align}
\mathcal{L} = -\sum_{i=1}^{N} \sum_{j=1}^{M_i} \log P(a_{i,j}|T_i, a_{i,<j}; \theta)
\end{align}
\input{tables/ttsllm_wer}
where $T_i$ is the i-th input text, $a_{i,j}$ is the j-th audio token in the i-th sample, $M_i$ is the length of the i-th speech token sequence, $\theta$ represents the model parameters, and $N$ is the number of training samples.

We use a multi-speaker text-to-speech datasets to train this model (see~\Cref{app:text-to-token-model-data} for detailed data distribution). We also include additional high-quality text-speech pairs generated using the CosyVoice~\citep{cosyvoice} model to improve accuracy for short or incomplete text spans. 
The architecture and the training details about the text-to-token model training can be found in~\Cref{app:text-to-token-model}.
To speedup the speech token generation process, we deployed the model using the SGLang framework~\citep{zheng2024sglangefficientexecutionstructured}, achieving a generation speed of 25k tokens per second on a single H800 instance.

\paragraph{Interleaved Data Construction}
To construct interleaved data from a text document, we apply a span corruption technique that randomly selects spans from the text sequence. Span lengths are drawn continuously from a Poisson distribution ($\lambda=10$) until the total length of selected spans reaches the predefined ratio $\eta$ of the original text length.
Next, text spans corresponding to the drawn lengths are randomly selected from the document. These spans are converted into speech tokens using the text-to-token model, producing an interleaved speech-text sequence.
The span corruption ratio $\eta$ plays a crucial role in enabling effective knowledge transfer between the speech and text modalities, as demonstrated in our ablation study (Cf.~\Cref{sec:ablation-span-corruption-ratio}). Based on the findings of this study, we set $\eta$ to 0.3 for optimal performance.
We selected high quality text datasets (FineWeb-Edu~\citep{fineweb} for English and Chinese-Fineweb-Edu~\citep{chinese-fineweb-edu} for chinese) to apply the previously mentioned synthesis process, generating a total of 600B tokens, with a 2:1 ratio of English to Chinese. \Cref{app:sample-interleaved} provides samples of interleaved data constructed.

We evaluated the performance of the text-to-token model on the VCTK~\citep{vctk} dataset and the interleaved data using word error rate (WER) as the evaluation metric. To compute WER, we used our speech decoder to synthesize real speech from the speech tokens generated by the text-to-token model, and then transcribed using \texttt{whisper-large-v3}~\citep{whisper}. The results are summarized in Table~\ref{tab:ttsllm_wer}.
For the standard VCTK dataset, we observed a lower WER of 3.20. However, the WER for speech spans generated from the text pre-training data was higher. We attribute this discrepancy primarily to some spans in the text pre-training data being difficult to pronounce accurately.

\subsection{Training}
We initialize the speech language model with a pre-trained large language models' parameters to leverage its existing knowledge. In order to support speech processing, we extend the model's vocabulary and embedding space. Specifically, we augment the original language model vocabulary, $V_{\mathrm{lang}}$, with a discrete speech vocabulary, $V_{\mathrm{speech}}$, resulting in a combined vocabulary, $V = V_{\mathrm{lang}} \cup V_{\mathrm{speech}}$. This expansion allows the model to accept both text and speech tokens as input and output. However, the ability to effectively understand and generate these tokens relies on subsequent training to align the text and speech modalities.

The training process consists of two stages. In the first stage, the model is pre-trained on synthetic interleaved data to learn the alignment between text and speech. In the second stage, fine-tuning is performed using speech dialogue data to enable the model to handle speech interactions.

% \input{tables/pretrain_data}

% \subsubsection{Speech-Text Pre-training}
% To expand Large Language Models' (LLMs) capabilities in speech-text modeling, we introduce a speech-text pre-training stage. This stage enables the model to effectively process and represent discrete speech tokens. Our pre-training process utilizes four data types, each serving a specific purpose:

% \begin{itemize}[leftmargin=*,itemsep=0pt,parsep=0.2em,topsep=0.0em,partopsep=0.0em]
% \item \textbf{Unsupervised text data:} We employ the same diverse corpus as \citet{chatglm}, comprising approximately 10T tokens. This extensive dataset encompasses a wide range of sources, including webpages, Wikipedia articles, books, code repositories, and research papers, maintaining and enhancing the LLM's existing language understanding capabilities.
% \item \textbf{Unsupervised speech data:} To improve modeling of diverse speech styles, we utilize the Emilia pipeline \citep{emilia}. This process includes audio standardization, source separation, speaker diarization, fine-grained VAD segmentation, language identification, and quality assessment using DNSMOS P.835 \citep{DNSMOS}. We retain only English and Chinese speech sequences with average DNSMOS P.835 scores exceeding 2.75, resulting in a 700k-hour high-quality speech collection.
% \item \textbf{Interleaved speech-text data:} To promote cross-modal understanding, we align text and speech modalities. For English, we synthesize speech from FineWeb-Edu \citep{fineweb}, an educationally focused subset of FineWeb containing 1.3T text tokens. The Chinese counterpart, Chinese-Fineweb-edu\footnote{\url{https://huggingface.co/datasets/opencsg/chinese-fineweb-edu}}, serves a similar purpose. These high-quality datasets facilitate the transfer of knowledge between textual and speech domains. \todo{Need improve}
% \item \textbf{Supervised speech-text data:} This category incorporates both Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) data, enabling the model to learn bidirectional correspondences between text and speech. We use xxh for ASR and xxh for TTS \todo{Add statistical}
% \end{itemize}

% We represent both text and speech as discrete tokens across all data types, preserving the LLM's autoregressive prediction objective while integrating speech capabilities. The total loss combines losses from all four data types:
% \begin{equation}
% \mathcal{L}_{\text{total}} = \lambda_1\mathcal{L}_{\text{text}} + \lambda_2\mathcal{L}_{\text{speech}} + \lambda_3\mathcal{L}_{\text{interleaved}} + \lambda_4\mathcal{L}_{\text{supervised}}
% \end{equation}
% Individual losses are calculated as:
% \begin{align}
% \mathcal{L}_{\text{unsupervised}}(D) &= -\sum_{j=1}^{m} \sum_{i=1}^{n_j} \log P(d_{i,j}|d_{<i,j}; \theta) \\
% \mathcal{L}_{\text{supervised}}(S) &= -\sum_{j=1}^{m_s} \sum_{i=p_j+1}^{y_j} \log P(s_{i,j}|s_{\leq p_j,j}; \theta)
% \end{align}
% Where $D$ represents unsupervised datasets (text, speech, or interleaved), $S$ is the supervised dataset, $m$ and $m_s$ are sample numbers, $n_j$ is token count per sample, $d_{i,j}$ and $s_{i,j}$ are tokens, $\theta$ represents model parameters, and for supervised loss, $p_j$ and $y_j$ are prefix and total token counts respectively.

% The proportions of the four types of data were set as follows: To maintain the model's strong language understanding capabilities, text data consistently occupied 30\% of each batch. Unsupervised speech and supervised speech-text data were each trained for exactly one epoch, regardless of total training tokens, to guarantee complete coverage of these specialized datasets. The remaining capacity was filled with interleaved data. This approach balanced the preservation of text comprehension with comprehensive exposure to speech-specific data, while promoting cross-modal integration through the interleaved samples.

\subsubsection{Speech-Text Pre-training}
\label{subsubsec:pretraining}

To extend LLMs' capabilities in speech-text tasks, we introduce a speech-text pre-training stage. This stage enables the model to process and represent discrete speech tokens. We utilize four data types, each serving a specific purpose:

\begin{itemize}[leftmargin=*,itemsep=0pt,parsep=0.2em,topsep=0.0em,partopsep=0.0em]
\item \textbf{Interleaved speech-text data:}
%To align text and speech, we synthesize speech from FineWeb-Edu~\citep{fineweb} and Chinese-Fineweb-Edu~\citep{chinese-fineweb-edu}. We selected both English and Chinese datasets to apply the previously mentioned synthesis process, generating a total of 600B tokens, with a 2:1 ratio of English to Chinese.
As described in~\Cref{sec:interleaved-data}, these datasets facilitate cross-modal knowledge transfer between text and speech. 
\item \textbf{Unsupervised text data:} We use a diverse corpus, similar to \citet{chatglm}, containing 10T tokens from webpages, Wikipedia, books, code, and research papers to maintain the model's language understanding.
\item \textbf{Unsupervised speech data:} Using the Emilia pipeline \citep{emilia}, we collected 700k hours of high-quality English and Chinese speech data, filtered by DNSMOS P.835 scores above 2.75, ensuring diverse and clean speech inputs.
\item \textbf{Supervised speech-text data:} This includes ASR and TTS data, teaching the model to learn bidirectional relationships between speech and text.
\end{itemize}

Both text and speech are represented as discrete tokens. The model is trained the next-token prediction objective with cross-entropy loss function. For text, speech, and interleaved data, the model is trained to predict all the tokens. For supervised speech-text data, the model is only trained to predict tokens in the target parts (text in ASR data and speech in TTS data). We set text data at 30\% of each batch to preserve language ability. Unsupervised speech and supervised speech-text data were trained for one epoch each, while interleaved data filled the remaining capacity, balancing language comprehension and speech processing. The detailed training data distribution can be found in~\Cref{app:training-datasets}.

\subsubsection{Supervised Fine-tuning}
Following speech-text pre-training, we fine-tune our model for speech dialogue tasks using a dataset derived from Magpie~\citep{magpie}. 
We use GPT-4 to adapt the original text-based dialogues for speech scenarios by filtering examples, shortening responses, and avoiding outputting text that cannot be read aloud.
The detailed prompt for this adaptation process can be found in the~\Cref{app:speech-dialogue-rewrite-prompt-template}. This curation process results in our SpeechDialog-90K dataset, which contains 90K triplets (SpeechI, TextR, SpeechR), where SpeechI is the speech instruction, TextR is the text response, and SpeechR is the corresponding speech response synthesized from TextR using MeloTTS~\citep{melotts}. For train hyper-paramaters, we use a batch size of 64, a sequence length of 4096 tokens, and train for 10 epochs with a learning rate decaying from 5e-5 to 5e-6. We use the AdamW optimizer.

\subsection{Inference}

During inference, our framework supports two modes: speech-to-speech and text-guided speech generation.
In speech-to-speech mode, the model directly processes speech input to generate speech output. Our streaming tokenizer converts the user's speech into discrete tokens, which the model then processes to generate output speech tokens. These output tokens are synthesized into continuous speech by our block-wise decoder, operating on 2-second blocks (25 tokens at 12.5Hz). The block-wise encoder computes representations which are then used by the conditional flow matching model to generate Mel spectrograms. Finally, these spectrograms are converted into continuous speech using the HiFi-GAN vocoder.
For text-guided speech generation, given the speech input (SpeechI), the model generates both a text response (TextR) and the corresponding speech response (SpeechR) in a single forward pass. The text response is generated as an intermediate step, which then guides the production of the final speech output. The corresponding template for two modes can be found in \cref{app:dialogue-prompt-template}.
