Learnable Wavelet-Enhanced Bidirectional Autoencoders: A Unified Framework for Multi-Resolution Speech Enhancement
Keywords: speech enhancement, learnable wavelet transform, bidirectional autoencoder, multi-resolution analysis, deep learning for audio, conjugate quadrature filters
TL;DR: WEBA: a bidirectional autoencoder with learnable wavelet transforms, adaptive thresholding, and sparsity-aware loss, that delivers SOTA speech enhancement on VoiceBank-DEMAND and CHiME-4 while using roughly half the parameters of competing models.
Abstract: Effective representation, reconstruction, and denoising of speech signals are critical challenges in speech signal processing, where signals often exhibit complex, multi-resolution structures. Traditional autoencoders address these tasks using separate networks for encoding and decoding, which increases memory usage and computational overhead. This paper introduces the Learnable Wavelet-Enhanced Bidirectional Autoencoder (WEBA), a framework tailored for efficient signal reconstruction and denoising. WEBA employs a bidirectional architecture, reusing the same network for both encoding and decoding, significantly reducing resource requirements. The framework incorporates adaptive wavelet-based representations through Learnable Fast Discrete Wavelet Transforms, ensuring multi-resolution analysis suited to complex signal structures. Additionally, it leverages Conjugate Quadrature Filters for orthogonal signal decomposition, a Learnable Asymmetric Hard Thresholding function for noise suppression, and a Sparcity-Enforcing Loss Function. By unifying these components into an end-to-end trainable framework, WEBA demonstrates superior performance in signal reconstruction and denoising, surpassing state-of-the-art methods on Valentini’s VoiceBank-DEMAND dataset and CHiME-4 dataset in five key metrics while enhancing computational efficiency. The source code of this paper is shared in this anonymous repository: https://anonymous.4open.science/r/LWE-BAE-A-Unified-Framework-for-Multi-Resolution-Signal-Processing-0DE4.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21345
Loading