Hierarchical Multimodal Variational Autoencoders

Jannik Wolff; Rahul G Krishnan; Lukas Ruff; Jan Nikolas Morshuis; Tassilo Klein; Shinichi Nakajima; Moin Nabi

Hierarchical Multimodal Variational Autoencoders

Jannik Wolff, Rahul G Krishnan, Lukas Ruff, Jan Nikolas Morshuis, Tassilo Klein, Shinichi Nakajima, Moin Nabi

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: hierarchical vae, variational inference, multimodal learning

Abstract: Humans find structure in natural phenomena by absorbing stimuli from multiple input sources such as vision, text, and speech. We study the use of deep generative models that generate multimodal data from latent representations. Existing approaches generate samples using a single shared latent variable, sometimes with marginally independent latent variables to capture modality-specific variations. However, there are cases where modality-specific variations depend on the kind of structure shared across modalities. To capture such heterogeneity, we propose a hierarchical multimodal VAE (HMVAE) that represents modality-specific variations using latent variables dependent on a shared top-level variable. Our experiments on the CUB and the Oxford Flower datasets show that the HMVAE can represent multimodal heterogeneity and outperform existing methods in sample generation quality and quantitative measures as the held-out log-likelihood.

One-sentence Summary: Multimodal modal data can have hierarchical structure and hence deserves a hierarchical latent space

Supplementary Material: zip

14 Replies

Loading