Leveraging Discriminative Latent Representations for Conditioning GAN-Based Speech Enhancement

Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel

Published: 01 Jan 2026, Last Modified: 06 May 2026IEEE Transactions on Audio, Speech and Language ProcessingEveryoneRevisionsCC BY-SA 4.0

Abstract: Generative speech enhancement methods based on generative adversarial networks (GANs) have demonstrated promising performance across various speech enhancement tasks. However, their performance in very low signal-to-noise ratio (SNR) scenarios remains under-explored and limited, as these conditions pose significant challenges to both discriminative and generative state-of-the-art methods. To address this, we propose DisCoGAN, a GAN-based speech enhancement method that leverages latent features extracted from discriminative speech enhancement models as generic conditioning information. By incorporating the proposed discriminative conditioning method, DisCoGAN improves speech quality and intelligibility, particularly in low-SNR scenarios, while maintaining competitive or superior performance in high-SNR conditions and real-world recordings. We also conduct a comprehensive evaluation of conventional GAN-based architectures, including end-to-end GANs, GAN-first, and post-filtering GANs, as well as discriminative models under low-SNR conditions, and show that DisCoGAN consistently outperforms existing methods. Finally, we present ablation studies that highlight the performance gains from discriminative conditioning and demonstrate how DisCoGAN leverages both local and global temporal context, providing insight into the key factors underlying these gains.

External IDs:doi:10.1109/taslpro.2026.3677639