Keywords: Cross-Modal, Streaming Codec, Single-Codebook, Low-bitrate, Contrastive Learning
Abstract: Discrete speech representations are critical for modern generative speech tasks and cross-modal modeling. However, current neural codecs often produce tokens that are either semantically redundant—entangled with paralinguistic variations like timbre—or structured in complex multi-codebook hierarchies that increase the complexity of downstream modeling. To bridge this gap, we propose \textbf{SecoustiCodec}, a streaming speech codec designed to extract \textit{disentangled}, \textit{single-codebook} discrete representations via cross-modal alignment. Unlike prior works relying on distillation from acoustic models, we introduce a frame-level text-speech contrastive learning framework that strictly aligns acoustic frames with linguistic units, effectively purging paralinguistic variance from the semantic codebook. To maintain high-fidelity reconstruction without compromising semantic purity, we explicitly model global paralinguistic attributes to complement the semantic tokens, allowing the decoder to synthesize fine-grained acoustic details from disentangled representations. Furthermore, we propose a semantic-only quantization mechanism combining Variational Autoencoders (VAE) and Finite Scalar Quantization (FSQ) to maximize codebook utilization and mitigate the long-tail distribution issue. SecoustiCodec supports low-latency streaming and achieves state-of-the-art reconstruction quality (PESQ 1.77/2.58 at 0.27/1 kbps). Audio samples are available at \url{https://anonymous.4open.science/w/SecoustiCodec_Page-86F2}. Code and models are provided at \url{https://anonymous.4open.science/r/SecoustiCodec-BE3E}.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Contribution Types: NLP engineering experiment
Languages Studied: English, Chinese
Submission Number: 5344
Loading