Refining Multimodal Representations using a modality-centric self-supervised moduleDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Multimodal modeling, self-supervision, metric learning
Abstract: Tasks that rely on multi-modal information typically include a fusion module that combines information from different modalities. In this work, we develop a self-supervised module, called REFINER, that refines multimodal representations using a decoding/defusing module applied downstream of the fused embedding. REFINER imposes a modality-centric responsibility condition, ensuring that both unimodal and fused representations are strongly encoded in the latent fusion space. Our approach provides both stronger generalization and reduced over-fitting. REFINER is only applied during training time keeping the inference time intact. The modular nature of REFINER lends itself to be combined with different fusion architectures easily. We demonstrate the power of REFINER on three datasets over powerful baseline fusion modules, and further show that they give a significant performance boost in few shot learning tasks.
One-sentence Summary: A self-supervised REFINER module is introduced to boost performance of Multimodal Fusion Networks.
12 Replies

Loading