Keywords: multimodal learning, selection bias
TL;DR: We formalize the framework for multimodal learning and identify a clear case under which multimodal modeling outperforms multidimensional modeling.
Abstract: Supervised multimodal learning is defined as learning to map a set of separate modalities to a target. Despite its intuitive definition, it is unclear whether one should model this problem using a multidimensional model, where the features from all the modalities are concatenated and treated as multidimensional features from a single modality or a multimodal model, where we use the information about the modality boundaries. In this work we formalize the framework for supervised multimodal learning and identify the conditions that favor multimodal modeling over multidimensional modeling. It is advantageous when the dependency across or within modalities shift during test time.
Through a series of synthetic experiments, where we fully control the data generation process, we verify the necessity of multimodal modeling for solving a supervised multimodal learning problem. Our proposed framework is agnostic to any assumptions pertaining to model architectures and can have a widespread impact by informing modeling choices when dealing with data from different modalities.
Submission Number: 75
Loading