Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time WarpingDownload PDF

Published: 14 Jun 2023, Last Modified: 26 Jun 2023SSW12Readers: Everyone
Keywords: Adaptive duration modification, seq-2-seq convolutional, DTW constrained attention
TL;DR: We propose a Convolutional-DTW framework to adaptively modify speaking rate of a recorded speech for voice/emotion conversion.
Abstract: We propose a new method to adaptively modify the rhythm of a given speech signal. We train a masked convolutional encoder- decoder network to generate this attention map via a stochastic version of the mean absolute error loss function. Our model also predicts the length of the target speech signal using the en- coder embeddings, which determines the number of time steps for the decoding operation. During testing, we use the learned attention map as a proxy for the frame-wise similarity matrix between the given input speech and an unknown target speech signal. In an open-loop fashion, we compute a warping path for rhythm modification. Our experiments demonstrate that this adaptive framework achieves similar performance as the fully supervised dynamic time warping algorithm on both voice con- version and emotion conversion tasks. We also show that the modified speech utterances achieve high user quality ratings, thus highlighting the practical utility of our method.
Supplementary Material: zip
4 Replies

Loading