Abstract: We propose a neural network-based technique for enhancing the quality of audio signals such as speech or music by transforming inputs encoded at low sampling rates into higher-quality signals with an increased resolution in the time domain. This amounts to generating the missing samples within the low-resolution signal in a process akin to image super-resolution. On standard speech and music datasets, this approach outperforms baselines at 2x, 4x, and 6x upscaling ratios. The method has practical applications in telephony, compression, and text-to-speech generation; it can also be used to improve the scalability of recently-proposed generative models of audio.
Conflicts: mcgill.ca, cornell.edu, berkeley.edu