Abstract: In the last years, deep learning models have achieved remarkable success in computer vision tasks, but their ability to process and reason about multi-modal data has been limited. The emergence of models leveraging contrastive loss to learn a joint embedding space for images and text has sparked research in multi-modal unsupervised alignment. This paper proposes a contrastive model for the multi-modal alignment of images and 3D representations. In particular, we study the alignment of images and raw point clouds on a learned latent space. The effectiveness of the proposed model is demonstrated through various experiments, including 3D shape retrieval from a single image, testing on out-of-distribution data, and latent space analysis.
Loading