Cross-utterance Conditioned Coherent Speech Editing via Biased Training and Entire Inference

Abstract: Text-based speech editing systems are developed to enable users to select, cut, copy and paste speech based on the transcript. Existing state-of-art editing systems based on neural networks do partial inferences with no exception, that is, only generate new words that need to be replaced or inserted. This manner usually leads to the prosody of the edited part being inconsistent with the previous and subsequent speech and the failure to handle the alteration of intonation. To address these problems, we propose a cross-utterance conditioned coherent speech editing system, which first does the entire reasoning at inference time. Benefiting from a cross-utterance conditioned variational autoencoder, our proposed system can forge speech by utilizing speaker information, context and acoustic features, and the mel-spectrogram of unedited fragments from the original audio. Also, we apply biased training to concentrate more attention on the part that needs to be reconstructed throughout training. Experiments conducted on subjective and objective metrics demonstrate that our approach outperforms the partial inference method on various editing operations regarding naturalness and prosody consistency.
