Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping
Keywords: Adaptive duration modification, seq-2-seq convolutional, DTW constrained attention
TL;DR: We propose a Convolutional-DTW framework to adaptively modify speaking rate of a recorded speech for voice/emotion conversion.
Abstract: We propose a new method to adaptively modify the rhythm of a
given speech signal. We train a masked convolutional encoder-
decoder network to generate this attention map via a stochastic
version of the mean absolute error loss function. Our model
also predicts the length of the target speech signal using the en-
coder embeddings, which determines the number of time steps
for the decoding operation. During testing, we use the learned
attention map as a proxy for the frame-wise similarity matrix
between the given input speech and an unknown target speech
signal. In an open-loop fashion, we compute a warping path
for rhythm modification. Our experiments demonstrate that this
adaptive framework achieves similar performance as the fully
supervised dynamic time warping algorithm on both voice con-
version and emotion conversion tasks. We also show that the
modified speech utterances achieve high user quality ratings,
thus highlighting the practical utility of our method.
Supplementary Material: zip
4 Replies
Loading