Keywords: multimodal representation learning, Probabilistic representation, image caption retrieval
Abstract: Learning multimodal representations is a requirement for many tasks such as image--caption retrieval. Previous work on this problem has only focused on finding good vector representations without any explicit measure of uncertainty. In this work, we argue and demonstrate that learning multimodal representations as probability distributions can lead to better representations, as well as providing other benefits such as adding a measure of uncertainty to the learned representations. We show that this measure of uncertainty can capture how confident our model is about the representations in the multimodal domain, i.e, how clear it is for the model to retrieve/predict the matching pair. We experiment with similarity metrics that have not been traditionally used for the multimodal retrieval task, and show that the choice of the similarity metric affects the quality of the learned representations.
One-sentence Summary: Proposing probabilistic multimodal representations for multimodal scenarios
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=ZFnXz1Sqx9
9 Replies
Loading