Abstract: The cognitive system of humans, which allows them to create representations of their surroundings exploiting multiple senses, has inspired several applications to mimic this remarkable property. The key for learning rich representations of data collected by multiple, diverse sensors, is to design generative models that can ingest multimodal inputs, and merge them in a common space. This enables to: i) obtain a coherent generation of samples for all modalities, ii) enable cross-sensor generation, by using available modalities to generate missing ones and iii) exploit synergy across modalities, to increase reconstruction quality. In this work, we study multimodal variational autoencoders, and propose new methods for learning a joint representation that can both improve synergy and enable cross generation of missing sensor data. We evaluate these approaches on well-established datasets as well as on a new dataset that involves multimodal object detection with three modalities. Our results shed light on the role of joint posterior modeling and training objectives, indicating that even simple and efficient heuristics enable both synergy and cross generation properties to coexist.
0 Replies
Loading