MDCTNet: A Hybrid Approach to Neural Audio Coding

Lars F. Villemoes, Mark Vinton, Per Ekstrand, Lie Lu, Grant A. Davidson, Cong Zhou

Published: 01 Jan 2024, Last Modified: 04 Mar 2025IEEE J. Sel. Top. Signal Process. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We describe and evaluate a hybrid neural audio coding system consisting of a perceptual audio encoder and a generative model, MDCTNet. By applying recurrent layers (RNNs) we capture correlations in both time and frequency directions in a perceptually weighted adaptive modified discrete cosine transform (MDCT) domain. By training MDCTNet on a diverse set of full-range monophonic audio signals at 48 kHz sampling, we achieve performance competitive with state-of-the-art audio coding at 24 kb/s variable bitrate (VBR). We also quantify the effect of the generative model-based decoding at lower and higher bitrates and discuss some caveats of the use of data driven signal reconstruction for the audio coding task.