\vspace{-1em}
\section{Related Work}
\textbf{Speech Tokenization}
Speech tokenizers, which transform a audio clip into discrete tokens, can be categorized into two directions. The neural acoustic codecs~\citep{soundstream,neuralac,audio_rvqgan,wavtokenizer} target at reconstructing high-quality audio at low bitrates. The semantic tokens~\citep{hubert,w2vbert} are extracted from speech representations learned with self-supervised learning on speech data. Speechtokenizer~\citep{speechtokenizer} unifies semantic and acoustic tokens as different residual vector quantization (RVQ) layers, but it also suffers from expansion of sequence length. Cosyvoice~\citep{cosyvoice} proposes the supervised semantic tokenizer derived from a speech recognition model, and successfully apply the tokenizer to text-to-speech synthesis. The application of the tokenizer on speech language modeling is not explored.

\textbf{Speech-Text Pre-training}
GSLM~\citep{GSLM} proposes the generative spoken language modeling task, which trains the next-token-prediction objective on discrete semantic tokens produced by self-supervised learning. AudioLM~\citep{AudioLM} proposes a hybrid tokenization scheme that combines semantic tokens with acoustic tokens from a neural audio codec~\citep{soundstream}. TWIST~\citep{TWIST} trains the speech language model using a warm-start from a pretrained text language model, specifically OPT~\citep{OPT}. Moshi~\citep{moshi} scales up the size of natural speech data in TWIST to 7 million hours. Spirit-LM~\citep{spiritlm} further extends TWIST by adding speech-text interleaving data curated from speech-text parallel corpus. However, the scarcity of parallel corpus restricts the scale of interleaving data. 

\textbf{End-to-End Spoken Chatbots} Early works in speech-to-speech models mainly focus on processing tasks like speech translation~\citep{speechnet,speecht5}. Since success of ChatGPT in text-based chatbots, many works have explored methods to develop speech-based chatbots that can understand and respond in speech. Speechgpt~\citep{speechgpt} proposes to combine existing large language models (LLM) with discrete speech representations to obtain speech conversational abilities.
% Qwen-audio~\citep{qwen-audio} expands the range of audio types and tasks in audio-to-text instruction tuning.
Moshi~\citep{moshi} proposes a full-duplex spoken dialogue framework based on their pretrained speech language model. Llama-omni~\citep{llama-omni} and Mini-omni~\citep{mini-omni} both propose light-weight alignment methods that transform an open language model into spoken chatbots.
