FastStitch: Speech editing by hitch-hiking a pre-trained FastSpeech2 model

Published: 03 Nov 2023, Last Modified: 23 Dec 2023NLDL 2024EveryoneRevisionsBibTeX
Keywords: tts, speech editing, speech synthesis
TL;DR: Speech editing with a pretrained FastSpeech2 model and a new proposed convolution-based or attention-based blending network that beats state-of-the-art methods.
Abstract: We present an innovative approach to speech editing, mitigating the time-consuming process of training acoustic models from scratch. Our methodology involves fine-tuning the upper layers of a pre-trained FastSpeech2 model and fusing it with information from a reference mel-spectrogram during inference via a convolution-based, or an attention-based, blending network. Comparative evaluations against baseline methods and against state-of-the-art techniques on single-speaker (LJSpeech) as well as multi-speaker (VCTK) datasets, employing both subjective and objective measures, demonstrate the superior quality of our approach, yielding significantly more natural-sounding speech edits.
Project: https://faststitch.github.io
Permission: pdf
Submission Number: 27
Loading