Learning to control the pitch of speech signals in the latent representation of a variational autoencoderDownload PDF

Anonymous

10 Mar 2022 (modified: 05 May 2023)Submitted to ICLR 2022 DGM4HSD workshopReaders: Everyone
Keywords: Deep generative models, variational autoencoder (VAE), speech processing, representation learning, controlling the pitch of speech
TL;DR: We propose a weakly supervised method for learning to control a continous latent factor in the latent space of a VAE, which we apply to the control of the pitch of speech signals.
Abstract: Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. Speech signals are produced from a few physically meaningful continuous latent factors governed by the anatomical mechanisms of phonation. Among these factors, the fundamental frequency is of primary importance as it characterizes the pitch of the voice, which is an important feature of the prosody. In this work, from a variational autoencoder (VAE) trained in an unsupervised fashion on hours of natural speech signals and a few seconds of labeled speech generated with an artificial synthesizer, we propose a weakly-supervised method to (i) identify the latent subspace of the VAE that only encodes the pitch and (ii) learn how to move into this subspace so as to precisely control the fundamental frequency. The proposed approach is applied to the transformation of speech signals and compared experimentally with traditional signal processing methods and a VAE baseline.
3 Replies

Loading