Abstract: Non-autoregressive text to speech models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. However, the memory and time complexity O(N2) of self-attention hinders FastSpeech from generating long sequences, where N is the length of mel-spectrograms. In this work, we propose LinearSpeech, an efficient parallel text-to-speech model with memory and computational complexity O(N). Firstly, we replace standard attention modules in decoder of the model with linear attention modules to reduce the time and memory cost. Secondly, we add a novel positional encoding to standard and linear attention modules, which enable the model to learn the order of input sequence and synthesizing long mel-spectrograms. Furthermore, we use reversible residual layers instead of the standard residuals, which reduce the memory consumption in training stage. In our experiments, LinearSpeech can be trained with doubled batch size than FastSpeech with similar number of parameters. At inference, LinearSpeech achieves more than 2.0× inference speedup on CPU when synthesizing mel-spectrograms longer than 3,500. And our model can synthesize 5.5× longer mel-spectrograms than FastSpeech when running out of 12GB GPU memory. Our subjective listening test also shows that the speech quality of LinearSpeech is comparable to FastSpeech.
Loading