Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders

Rogelio A. Mancisidor; Robert Jenssen; Shujian Yu; Michael Kampffmeyer

Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders

Rogelio A. Mancisidor, Robert Jenssen, Shujian Yu, Michael Kampffmeyer

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Multimodal learning with VAEs where the dependence between expert distributions is taken into account.

Abstract: Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO). Current methods, the product and mixture of experts, aggregate single-modality distributions assuming independence for simplicity, which is an overoptimistic assumption. This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle of *consensus of dependent experts* (CoDE), which circumvents the aforementioned assumption. Utilizing the CoDE method, we propose a novel ELBO that approximates the joint likelihood of the multimodal data by learning the contribution of each subset of modalities. The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations. CoDE-VAE further minimizes the generative quality gap as the number of modalities increases. In certain cases, it reaches a generative quality similar to that of unimodal VAEs, which is a desirable property that is lacking in most current methods. Finally, the classification accuracy achieved by CoDE-VAE is comparable to that of state-of-the-art multimodal VAE models.

Lay Summary: In today’s world, we often deal with information coming from many different sources at once, like images, text, and sound. Teaching computers to understand and combine this kind of mixed information to generate realistic and consistent data is called multimodal learning. A popular tool for this purpose is a type of artificial intelligence model known as multimodal variational autoencoder (VAE). However, current methods for combining different types of data in multimodal VAEs often make a simplifying assumption: that each type of data is independent from the others. This assumption makes the math easier, but it doesn’t reflect how things work in the real world where, for example, a picture and its caption are clearly related. This research introduces a smarter way to combine different types of data by using a new method called CoDE (Consensus of Dependent Experts). Rather than assuming independence, CoDE models the interdependency between each type of data. This leads to a new and improved version of a multimodal VAE, called CoDE-VAE. The results show that CoDE-VAE does a better job at generating realistic and consistent data, especially when more types of data are involved. It also matches the accuracy of the best existing models when it comes to classifying information.

Link To Code: https://github.com/rogelioamancisidor/codevae

Primary Area: Probabilistic Methods->Variational Inference

Keywords: Multimodal Learning, Product of Experts, Generative Models, Variational Autoencoders

Submission Number: 13017

Loading