Keywords: causality, causal machine learning, out-of-distribution generalization, multimodal learning
TL;DR: We apply ideas from causality to address the issue of multimodal learning performing worse than unimodal learning.
Abstract: We adopt a causal perspective to address why multimodal learning often performs worse than unimodal learning. We put forth a structural causal model (SCM) for which multimodal learning is preferable over unimodal learning. In this SCM, which we call the multimodal SCM, a latent variable causes the inputs, and the inputs cause the target. We refer to this latent variable as an input-causing confounder. By conditioning on all inputs, multimodal learning $d$-separates the input-causing confounder and the target, resulting in a causal model that is more robust than the statistical model learned by unimodal learning. We argue that multimodal learning fails in practice because our finite datasets appear to come from an alternative SCM, which we call the spurious SCM. In the spurious SCM, the input-causing confounder and target are conditionally dependent given the inputs. This means that multimodal learning no longer $d$-separates the input-causing confounder and the target, and fails to estimate a causal model. We use a latent variable model to model the input-causing confounder, and test whether its undesirable dependence with the target is present in the data. We then use the same model to remove this dependence and estimate a causal model, which corresponds to the backdoor adjustment. We use synthetic data experiments to validate our claims.