Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0
Abstract: Achieving high-fidelity audio compression while preserving perceptual quality across diverse audio types remains a significant challenge in Neural Audio Coding (NAC). This paper introduces MUFFIN, a fully convolutional NAC framework that leverages psychoacoustically guided multi-band frequency reconstruction. Central to MUFFIN is the Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) mechanism, which quantizes latent speech across different frequency bands. This approach optimizes bitrate allocation and enhances fidelity based on psychoacoustic studies, achieving efficient compression with unique perceptual features that separate content from speaker attributes through distinct codebooks. MUFFIN integrates a transformer-inspired convolutional architecture with proposed modified snake activation functions to capture fine frequency details with greater precision. Extensive evaluations on diverse datasets (LibriTTS, IEMOCAP, GTZAN, BBC) demonstrate MUFFIN’s ability to consistently surpass existing performance in audio reconstruction across various domains. Notably, a high-compression variant achieves an impressive SOTA 12.5 kHz rate while preserving reconstruction quality. Furthermore, MUFFIN excels in downstream generative tasks, demonstrating its potential as a robust token representation for integration with large language models. These results establish MUFFIN as a groundbreaking advancement in NAC and as the first neural psychoacoustic coding system. Speech demos and codes are available at \url{https://demos46.github.io/muffin/} and \url{https://github.com/dianwen-ng/MUFFIN}.
Lay Summary: Despite recent progress in neural audio coding (NAC), most systems still struggle to maintain perceptual quality at high compression rates, especially across diverse audio types. Moreover, they largely ignore psychoacoustic principles that underpin how humans perceive sound. We introduce MUFFIN, the first Neural Psychoacoustic Codec (NPC), which leverages a novel multi-band spectral quantization strategy aligned with human auditory perception. By separating and encoding low, mid, and high-frequency bands differently, MUFFIN preserves critical features like speech intelligibility, content articulation, and speaker identity. Our model also uses an enhanced snake activation for fine spectral detail. MUFFIN achieves state-of-the-art audio quality at extreme compression rates, enabling high-fidelity audio even at just 12.5 Hz. It improves zero-shot text-to-speech synthesis and offers disentangled token representations, making it ideal for integration with large language models. This positions MUFFIN as a key building block for bandwidth-efficient, perceptually rich, and controllable audio applications—from streaming to assistive AI.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/dianwen-ng/MUFFIN
Primary Area: Deep Learning->Other Representation Learning
Keywords: Neural Audio Codec, Neural Psychoacoustic Codec, Zero-shot TTS, Vector Quantization
Submission Number: 2777
Loading