CSAVocoder: A Causal Spatial Audio Vocoder Towards Real-Time Spatial Audio Generation

ACL ARR 2026 January Submission5915 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Spatial Audio, Neural Vocoder, Cusal Streaming Voice Synthesis
Abstract: Spatial audio vocoders aim to convert mel-spectrograms produced by generative models into spatial audio waveforms. Most existing vocoder research focuses on monaural audio, and direct extensions to spatial audio often degrade spatial quality by ignoring inter-channel cues. We present CSAVocoder, a causal GAN-based Spatial Audio Vocoder that jointly optimizes waveform fidelity and spatial rendering. Our framework introduces a spatial adapter that fuses multi-channel mel-spectrograms with dynamic source-listener pose information, and a spatial consistency discriminator that explicitly supervises inter-channel spatial cues such as interaural level and phase differences. To meet real-time requirements, we design a strictly causal, stateful generator that supports efficient streaming inference with constant memory overhead. The architecture supports different spatial audio without format-specific modifications. Experiments on large-scale spatial audio datasets demonstrate that CSAVocoder ensures audio quality and spatial fidelity while maintaining real-time performance. Our demo page is at: \url{https://csavocoder.github.io}.
Paper Type: Long
Research Area: Speech Processing and Spoken Language Understanding
Research Area Keywords: speech technologies
Contribution Types: NLP engineering experiment
Languages Studied: English, Chinese
Submission Number: 5915
Loading