A Robust Acoustic Feature Extraction Approach Based on Stacked Denoising Autoencoder

Published: 01 Jan 2015, Last Modified: 24 Apr 2025BigMM 2015EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Acoustic feature extraction (AFE) is considered as one of the most challenging techniques for speech applications since the adverse environment noises always cause significant variation on the extracted acoustic features. In this paper, we propose a systematical AFE approach which based on stacked denoising auto encoder (SDAE) aiming at extracting acoustic features automatically. Denoising auto encoder (DAE), which is trained to reconstruct a clean "repaired" input from a corrupted version of it, works as the basic building block to form SDAE. Besides, the training set with clean and noisy speech ensures the SDAE has much powerful ability to extract the robust features under different noise conditions. Considering the speaker classification task using features extracted by the proposed approach for evaluation, intensive experiments have been conducted on TIMIT and NIST SRE 2004 to show SDAE with 3 hidden layers (3L-SDAE) gives better performance than shallow layers. The results also show that the features extracted by 3L-SDAE performs better than MFCC features when SNR is lower than 6dB and act more robustly when SNR decreases. What's more, for different types of noises at SNR of 0dB, the accuracy of speaker classification using 3L-SDAE features is higher than about 84% while MFCC features is lower than 77%.
Loading