Wavescale Neural Audio Codec: Bidirectional Multiscale Residual Quantization for High-Fidelity Audio Compression

Wavescale Neural Audio Codec: Bidirectional Multiscale Residual Quantization for High-Fidelity Audio Compression

ICLR 2026 Conference Submission16685 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural Audio Compression, Variational AutoEncoder, Multiscale Residual Vector Quantization, Codebook, Wavescale, Waveloss

TL;DR: A multiscale RVQ-VAE based neural audio codec that improves compression and reconstruction of general sounds by introducing a downscale-upscale quantization framework with stage-wise alignment loss.

Abstract: Modern AI systems need audio representations that are efficient in bandwidth and friendly to models. Neural codecs learn discrete token streams optimized for perceptual and task goals, unifying compression with generation, editing, retrieval and multimodal reasoning. Neural compression with residual vector quantization (RVQ) achieves low bitrates at high quality by encoding audio as discrete latents. Recent multiscale RVQ variants (e.g., SAT, SNAC) distribute quantization across multiple temporal scales to reduce token rate and computational cost; however, a purely upscale hierarchy assigns coarse (low-rate, slowly varying) structure to early stages where typically low-frequency components are assigned and fine (high-rate, rapidly varying) detail to later stages where typically high-frequency components are assigned. This works well for speech but often fails for music and environmental audio: in music, early stages can carry fine detail, whereas in environmental audio, periodicity is weak. We introduce the Wavescale Neural Audio Codec (WNAC), which replaces the pure upscale flow with a downscale then upscale path. By inserting fine-to-coarse stages before coarse-to-fine, WNAC preserves early low frequency information. We also add a scale-aware waveloss that aligns quantized outputs at the same temporal resolution across stages, improving reconstruction sharpness and stability. Experiments show higher accuracy and efficiency across speech, music, environment and a mixed general set, outperforming single-scale DAC while keeping the speed benefits of multiscale RVQ.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16685

Loading