Adaptive Speech Duration Modification using a Deep-Generative FrameworkDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: Prosody, Encoder-Decoder, Attention, Adaptive Duration Modification, Dynamic Time Warping
Abstract: We propose the first method to adaptively modify the duration of a given speechsignal. Our approach uses a Bayesian framework to define a latent attention mapthat links frames of the input and target utterances. We train a masked convolu-tional encoder-decoder network to generate this attention map via a stochastic ver-sion of the mean absolute error loss function. Our model also predicts the lengthof the target speech signal using the encoder embeddings, which determines thenumber of time steps for the decoding operation. During testing, we generate theattention map as a proxy for the similarity matrix between the given input speechand an unknown target speech signal. Using this similarity matrix, we compute awarping path of alignment between the two signals. Our experiments demonstratethat this adaptive framework produces similar results to dynamic time warping,which relies on a known target signal, on both voice conversion and emotion con-version tasks. We also show that the modified speech utterances achieve high userquality ratings, thus highlighting the practical utility of our method.
One-sentence Summary: We propose a generative model to locally manipulate speaking rate in a given speech utterance that can be adapted for tasks such as voice/accent conversion and emotion conversion.
5 Replies

Loading