BrainSpeech: Parallel Semantic–Acoustic Generation for SpeechLMs

BrainSpeech: Parallel Semantic–Acoustic Generation for SpeechLMs

ACL ARR 2026 January Submission3249 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, speech interaction, speech-language models

Abstract: Recent advances in Speech Language Models (SpeechLMs) have enabled end-to-end spoken dialogue systems. However, most existing SpeechLMs rely on a single-stream or sequential modeling paradigm, in which semantic reasoning and acoustic generation are either flattened into one stream or processed in a fixed Thinker–Talker order. Both paradigms constrain the interaction between semantic reasoning and acoustic generation, limiting the naturalness and expressiveness of spoken dialogue. To address this, we propose BrainSpeech, a novel SpeechLM framework that models semantic reasoning and acoustic generation as two parallel yet interacting streams within a unified large language model. To enable effective cross-stream interaction, we introduce a dedicated attention mechanism that explicitly exchanges semantic and acoustic information during generation. Building on this dual-stream formulation, we further develop a three-stage training strategy that preserves the reasoning capability of the underlying language model while reducing the reliance on large-scale spoken dialogue corpora. In addition, we design a streaming decoding strategy that supports real-time generation of continuous, high-fidelity speech. Experiments on multiple spoken dialogue benchmarks demonstrate that BrainSpeech produces more natural and expressive speech and achieves superior speech-to-speech performance compared to larger open-source SpeechLMs (Audio samples are available at https://brainspeech.github.io/).

Paper Type: Long

Research Area: Generalizability and Transfer

Research Area Keywords: spoken dialogue systems, spoken language understanding, spoken dialog

Contribution Types: Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 3249

Loading