STD: Using Self-attention Discriminators to Improve Speech Synthesis

Kuan Guo, Xingyuan Chen, Yicui Peng, Bing Wang, Weishan Feng, Chunhao Wang, Peng Jin

Published: 2024, Last Modified: 28 Dec 2025ICONIP (10) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Generative Adversarial Networks (GANs) have been effectively applied to vocoder tasks, enabling the reconstruction of audio waveforms from acoustic representations at speeds surpassing real-time. Nevertheless, synthesizing high-fidelity audio using existing GAN-based vocoders remains a challenging task. While widely utilized, discriminators using Convolutional Neural Networks (CNNs) exhibit training inefficiencies and inadequacies in capturing complex patterns in long, periodic sequences. To overcome these issues, this study introduces STD, a novel discriminator employing a self-attention mechanism that autonomously highlights critical information in periodic structures. Unlike CNNs, this mechanism processes all features concurrently in sequence data, effectively capturing long-range dependencies. Our experimental analysis on the LibriTTS and VCTK shows that the proposed method markedly enhances training efficiency and the quality of speech generation. Furthermore, significant improvements in objective evaluation metrics, underscore the effectiveness and potential of our approach. We release our code and model at: https://github.com/TN2024/BigVGAN_STD.

External IDs:dblp:conf/iconip/GuoCPWFWJ24