Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping

Ravi Shankar; Archana Venkataraman

Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping

Ravi Shankar, Archana Venkataraman

Published: 14 Jun 2023, Last Modified: 26 Jun 2023SSW12Readers: Everyone

Keywords: Adaptive duration modification, seq-2-seq convolutional, DTW constrained attention

TL;DR: We propose a Convolutional-DTW framework to adaptively modify speaking rate of a recorded speech for voice/emotion conversion.

Abstract: We propose a new method to adaptively modify the rhythm of a given speech signal. We train a masked convolutional encoder- decoder network to generate this attention map via a stochastic version of the mean absolute error loss function. Our model also predicts the length of the target speech signal using the en- coder embeddings, which determines the number of time steps for the decoding operation. During testing, we use the learned attention map as a proxy for the frame-wise similarity matrix between the given input speech and an unknown target speech signal. In an open-loop fashion, we compute a warping path for rhythm modification. Our experiments demonstrate that this adaptive framework achieves similar performance as the fully supervised dynamic time warping algorithm on both voice con- version and emotion conversion tasks. We also show that the modified speech utterances achieve high user quality ratings, thus highlighting the practical utility of our method.

Supplementary Material: zip

4 Replies

Loading