Multimodal Language Learning for Object Retrieval in Low Data Regimes in the Face of Missing Modalities

Published: 27 Oct 2023, Last Modified: 27 Oct 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Our study is motivated by robotics, where when dealing with robots or other physical systems, we often need to balance competing concerns of relying on complex, multimodal data coming from a variety of sensors with a general lack of large representative datasets. Despite the complexity of modern robotic platforms and the need for multimodal interaction, there has been little research on integrating more than two modalities in a low data regime with the real-world constraint that sensors fail due to obstructions or adverse conditions. In this work, we consider a case in which natural language is used as a retrieval query against objects, represented across multiple modalities, in a physical environment. We introduce extended multimodal alignment (EMMA), a method that learns to select the appropriate object while jointly refining modality-specific embeddings through a geometric (distance-based) loss. In contrast to prior work, our approach is able to incorporate an arbitrary number of views (modalities) of a particular piece of data. We demonstrate the efficacy of our model on a grounded language object retrieval scenario. We show that our model outperforms state-of-the-art baselines when little training data is available. Our code is available at
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url:
Changes Since Last Submission: For the camera-ready version, we have addressed the points raised by the reviewers and action editor including adding discussion about related work, clarifications, and improved presentation of results. For the initial resubmission, based on the insightful comments from the reviewers and action editor, we made the following changes. These include new results and analyses. Our most major change is to conduct an experiment where we train all models with low amounts of training data and test them on the same test set. We show that our model, EMMA, outperforms the baselines by a sizable difference in MRR. More specifically, when all models are trained with 25 percent of the training data, EMMA (ours) achieves an MRR of 73.07$\pm$0.39 if the text modality is ablated during the test while SupCon (baseline) achieves an MRR of 46.39$\pm$0.33. When all modalities are available, EMMA achieves 90.72$\pm$0.34 and still outperforms SupCon with an MRR of 83.25$\pm$0.68. We report the complete set of results in table 1. The details of this experiment are provided in section 5.3 (Learning from Limited Data), and figures 2 and 3 show the MRR score as a function of the amount of training data. This new experiment would address the shared comment from reviewers and action editors about the significance of the results. The next major change is providing a deeper analysis of the components of our loss function in section 5.6 (EMMA Component Contributions). We performed an experiment in which we vary the weight of the two components of EMMA (Geom and SupCon), where the weights add up to one, and its effect on both MRR and the distance between dissimilar items in the latent space. As shown in figure 5, we observe a positive correlation between increasing the weight of Geom and improvements in the MRR score. Moreover, we observe a positive correlation between increasing the weight of Geom and more dissimilar items being mapped outside an enforced margin from each other ensuring a minimum distance between dissimilar representations. This experiment shows the effectiveness of our proposed Geometric Alignment method as well as the impact of EMMA's components. We hope this change in the paper would address the comments of the reviewers. Moreover, we did a thorough pass over the paper, reorganized it, defined concepts before using them, fixed the typos mentioned, incorporated the requested clarifications, and reduced the length of the paper substantially to make it more readable. More specifically, we removed almost three pages which changes the paper from a long submission to a regular submission.
Assigned Action Editor: ~Yonatan_Bisk1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1437