Abstract: Previous works (Donahue et al., 2018a; Engel et al., 2019) have found that generat-ing coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score) suggest that our model is state-of-the-art for mel-spectrogram inversion. We show qualitative results on speech synthesis, music domain translation and unconditional music synthesis, to establish the generality of the proposed techniques. We also evaluate different components of the model, proposing a set of guidelines for designing general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive, fully convolutional, with significantly fewer parameters as compared to competing models and generalizes to unseen speakers for mel-spectrogram inversion. Our pytorch implementation runs at more than 100x faster than realtime on GTX 1080Ti GPU and more than 2x faster than realtime on CPU, without any hardware specific optimization tricks.
Code Link: https://github.com/descriptinc/melgan-neurips
CMT Num: 8485
2 Replies
Loading