Cross-Modal Aligned Streaming Single-Codebook Speech Codec

Cross-Modal Aligned Streaming Single-Codebook Speech Codec

ACL ARR 2026 January Submission5344 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cross-Modal, Streaming Codec, Single-Codebook, Low-bitrate, Contrastive Learning

Abstract: Discrete speech representations are critical for modern generative speech tasks and cross-modal modeling. However, current neural codecs often produce tokens that are either semantically redundant—entangled with paralinguistic variations like timbre—or structured in complex multi-codebook hierarchies that increase the complexity of downstream modeling. To bridge this gap, we propose \textbf{SecoustiCodec}, a streaming speech codec designed to extract \textit{disentangled}, \textit{single-codebook} discrete representations via cross-modal alignment. Unlike prior works relying on distillation from acoustic models, we introduce a frame-level text-speech contrastive learning framework that strictly aligns acoustic frames with linguistic units, effectively purging paralinguistic variance from the semantic codebook. To maintain high-fidelity reconstruction without compromising semantic purity, we explicitly model global paralinguistic attributes to complement the semantic tokens, allowing the decoder to synthesize fine-grained acoustic details from disentangled representations. Furthermore, we propose a semantic-only quantization mechanism combining Variational Autoencoders (VAE) and Finite Scalar Quantization (FSQ) to maximize codebook utilization and mitigate the long-tail distribution issue. SecoustiCodec supports low-latency streaming and achieves state-of-the-art reconstruction quality (PESQ 1.77/2.58 at 0.27/1 kbps). Audio samples are available at \url{https://anonymous.4open.science/w/SecoustiCodec_Page-86F2}. Code and models are provided at \url{https://anonymous.4open.science/r/SecoustiCodec-BE3E}.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Contribution Types: NLP engineering experiment

Languages Studied: English, Chinese

Submission Number: 5344

Loading