Keywords: large language models, speech interaction, speech-language models
Abstract: Recent advances in Speech Language Models (SpeechLMs) have enabled end-to-end spoken dialogue systems. However, most existing SpeechLMs rely on a single-stream or sequential modeling paradigm, in which semantic reasoning and acoustic generation are either flattened into one stream or processed in a fixed Thinker–Talker order. Both paradigms constrain the interaction between semantic reasoning and acoustic generation, limiting the naturalness and expressiveness of spoken dialogue. To address this, we propose BrainSpeech, a novel SpeechLM framework that models semantic reasoning and acoustic generation as two parallel yet interacting streams within a unified large language model. To enable effective cross-stream interaction, we introduce a dedicated attention mechanism that explicitly exchanges semantic and acoustic information during generation. Building on this dual-stream formulation, we further develop a three-stage training strategy that preserves the reasoning capability of the underlying language model while reducing the reliance on large-scale spoken dialogue corpora. In addition, we design a streaming decoding strategy that supports real-time generation of continuous, high-fidelity speech. Experiments on multiple spoken dialogue benchmarks demonstrate that BrainSpeech produces more natural and expressive speech and achieves superior speech-to-speech performance compared to larger open-source SpeechLMs (Audio samples are available at https://brainspeech.github.io/).
Paper Type: Long
Research Area: Generalizability and Transfer
Research Area Keywords: spoken dialogue systems, spoken language understanding, spoken dialog
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 3249
Loading