TPTGAN: Two-Path Transformer-Based Generative Adversarial Network Using Joint Magnitude Masking and Complex Spectral Mapping for Speech Enhancement

Zhaoyi Liu; Zhuohang Jiang; Wendian Luo; Zhuoyao Fan; Haoda Di; Yufan Long; Haizhou Wang

TPTGAN: Two-Path Transformer-Based Generative Adversarial Network Using Joint Magnitude Masking and Complex Spectral Mapping for Speech Enhancement

Zhaoyi Liu, Zhuohang Jiang, Wendian Luo, Zhuoyao Fan, Haoda Di, Yufan Long, Haizhou Wang

Published: 01 Jan 2023, Last Modified: 03 Nov 2024ICONIP (9) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In recent studies, conformer is extensively employed in speech enhancement. Nevertheless, it continues to confront the challenge of excessive suppression, especially in human-to-machine communication, attributed to the unintended loss of target speech during noise filtering. While these methods may yield higher Perceptual Evaluation of Voice Quality (PESQ) scores, they often exhibit limited effectiveness in improving the signal-to-noise ratio of speech which is proved vital in automatic speech recognition. In this paper, we propose a two-path transformer-based metric generative adversarial network (TPTGAN) for speech enhancement in the time-frequency domain. The generator consists of an encoder, a two-stage transformer module, a magnitude mask decoder and a complex spectrum decoder. Encoder and two-path transformers characterize the magnitude and complex spectra of the inputs and model both sub-band and full-band information of the time-frequency spectrogram. The estimation of magnitude and complex spectrum is decoupled in the decoder, and then the enhanced speech is reconstructed in conjunction with the phase information. Through the implementation of intelligent training strategies and structural adjustments, we have successfully showcased the remarkable efficacy of the transformer model in speech enhancement tasks. The experimental results on the Voice Bank+DEMAND dataset illustrate that TPTGAN shows superior performance compared to existing state-of-the-art methods, with SSNR of 11.63 and PESQ of 3.35, which alleviates the problem of excessive suppression, while the complexity of the model (1.03M) is significantly reduced.

Loading