Detecting incidental correlation in multimodal learning via latent variable modeling

Published: 06 Sept 2023, Last Modified: 06 Sept 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Multimodal neural networks often fail to utilize all modalities. They subsequently generalize worse than their unimodal counterparts, or make predictions that only depend on a subset of modalities. We refer to this problem as \emph{modality underutilization}. Existing work has addressed this issue by ensuring that there are no systematic biases in dataset creation, or that our neural network architectures and optimization algorithms are capable of learning modality interactions. We demonstrate that even when these favorable conditions are met, modality underutilization can still occur in the small data regime. To explain this phenomenon, we put forth a concept that we call \emph{incidental correlation}. It is a spurious correlation that emerges in small datasets, despite not being a part of the underlying data generating process (DGP). We develop our argument using a DGP under which multimodal neural networks must utilize all modalities, since all paths between the inputs and target are causal. This represents an idealized scenario that often fails to materialize. Instead, due to incidental correlation, small datasets sampled from this DGP have higher likelihood under an alternative DGP with spurious paths between the inputs and target. Multimodal neural networks that use these spurious paths for prediction fail to utilize all modalities. Given its harmful effects, we propose to detect incidental correlation via latent variable modeling. We specify an identifiable variational autoencoder such that the latent posterior encodes the spurious correlations between the inputs and target. This allows us to interpret the Kullback-Leibler divergence between the latent posterior and prior as the severity of incidental correlation. We use an ablation study to show that identifiability is important in this context, since we derive our conclusions from the latent posterior. Using experiments with synthetic data, as well as with VQA v2.0 and NLVR2, we demonstrate that incidental correlation emerges in the small data regime, and leads to modality underutilization. Practitioners of multimodal learning can use our method to detect whether incidental correlation is present in their datasets, and determine whether they should collect additional data.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Experiments: * Experiments on another realistic dataset called NLVR2 * Rerunning the unidentifiable VAE experiments after changing the decoder activation function * VAE experiments on VQA v2.0 and NLVR2 with a 128-dimensional latent variable and 32 components in the mixture prior (this is smaller compared to the 512-dimensions and 128 components used in the experiments in the main text) * Toy problem experiments with different data generating parameters Writing: * Explaining why there are no edges from z to {x, x'} in our model * Improved discussion on why we can interpret the KL term as the severity of incidental correlation * Improved discussion on how the mixture prior is parameterized and learned * Improved discussion of future directions * Some minor changes such as fixing typos and adding references
Assigned Action Editor: ~Thang_D_Bui1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1117