T-DVAE: A Transformer-Based Dynamical Variational Autoencoder for Speech

Jan-Ole Perschewski, Sebastian Stober

Published: 01 Jan 2024, Last Modified: 09 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: In contrast to Variational Autoencoders, Dynamical Variational Autoencoders (DVAEs) learn a sequence of latent states for a time series. Initially, they were implemented using recurrent neural networks (RNNs) known for challenging training dynamics and problems with long-term dependencies. This led to the recent adoption of Transformers close to the RNN-based implementation. These implementations still use RNNs as part of the architecture even though the Transformer can solve the task as the sole building block. Hence, we improve the LigHT-DVAE architecture by removing the dependence on RNNs and Cross-Attention. Furthermore, we show that a trained LigHT-DVAE ignores output-to-hidden connections, which allows us to simplify the overall architecture by removing output-to-hidden connections. We demonstrate the capability of the resulting T-DVAE on librispeech and voice bank with an improvement in training time, memory consumption, and generative performance.

External IDs:doi:10.1007/978-3-031-72350-6_3