Abstract: Recently, the field of Text-to-speech synthesis has been predominantly characterized by end-to-end models, with the quality of speech generated by these models becoming increasingly comparable to that of human speech. In this work, we propose a Lightweight and Efficient Text-to-speech model, a fast end-to-end framework based on EfficientTTS 2 with fully differentiable. We utilize Fast Linear Attention with a Single Head instead of the standard stacked Transformer, which decreases computational complexity and reduces parameters. Additionally, we improve a network architecture ConvWaveNet to further decrease model parameters, and accelerate inference through a multi-stream inverse short-time Fourier Transform generator. These improvements significantly reduce model parameters and increase inference speed, thereby achieving the objectives of faster inference and lightweight modeling. Experimental results show that the proposed model achieves speech quality comparable to that of the baseline models, while also offering improved inference speed and reduced model size.
Loading