Joint Multimodal Learning with Deep Generative Models

Masahiro Suzuki; Kotaro Nakayama; Yutaka Matsuo

Joint Multimodal Learning with Deep Generative Models

Masahiro Suzuki, Kotaro Nakayama, Yutaka Matsuo

30 Jun 2025 (modified: 22 Jun 2025)Submitted to ICLR 2017Readers: Everyone

Abstract: We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models. However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that JMVAE can generate multiple modalities bi-directionally.

Conflicts: u-tokyo.ac.jp

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/joint-multimodal-learning-with-deep/code)

3 Replies

Loading