FastStitch: Speech editing by hitch-hiking a pre-trained FastSpeech2 model

Antonios Alexos; Pierre Baldi

FastStitch: Speech editing by hitch-hiking a pre-trained FastSpeech2 model

Antonios Alexos, Pierre Baldi

Published: 03 Nov 2023, Last Modified: 06 Nov 2024NLDL 2024EveryoneRevisionsBibTeX

Keywords: tts, speech editing, speech synthesis

TL;DR: Speech editing with a pretrained FastSpeech2 model and a new proposed convolution-based or attention-based blending network that beats state-of-the-art methods.

Abstract: We present an innovative approach to speech editing, mitigating the time-consuming process of training acoustic models from scratch. Our methodology involves fine-tuning the upper layers of a pre-trained FastSpeech2 model and fusing it with information from a reference mel-spectrogram during inference via a convolution-based, or an attention-based, blending network. Comparative evaluations against baseline methods and against state-of-the-art techniques on single-speaker (LJSpeech) as well as multi-speaker (VCTK) datasets, employing both subjective and objective measures, demonstrate the superior quality of our approach, yielding significantly more natural-sounding speech edits.

Project: https://faststitch.github.io

Submission Number: 27

Loading