Keywords: neural audio codec, residual vector quantization
TL;DR: We introduce a high-fidelity neural audio codec for compressing 44.1kHz music into discrete or continuous latent representations.
Abstract: Neural audio codecs have become increasingly important for audio compression and, more recently, for creating tokenized representations for various generative downstream tasks. Consequently, the performance of neural audio codecs plays a crucial role in many applications. In this work, we introduce DisCodec, a high-fidelity neural audio codec for compressing 44.1kHz music into discrete or continuous latent representations. DisCodec leverages ConvNeXt and attention layers, an affine re-parametrization of the code vectors, and an improved commitment loss for better alignment between codebooks and model embeddings. We study comparisons of DisCodec against existing codecs, perform a comprehensive ablation of the proposed architecture, and demonstrate its performance against state-of-the-art neural audio codecs. We make the DisCodec codebase and model checkpoints available at https://github.com/ETH-DISCO/discodec.
Submission Number: 31
Loading