CLIPGraphs: Multimodal Graph Networks to Infer Object-Room AffinitiesDownload PDF

Published: 07 May 2023, Last Modified: 08 May 2023ICRA-23 Workshop on Pretraining4Robotics LightningReaders: Everyone
Keywords: Common Sense Knowledge, Graph Convolutional Networks, Knowledge Graphs, LLMs, Semantic Priors
TL;DR: This paper introduces CLIPGraphs, a novel method for determining the best room to place an object in, for the task of embodied scene rearrangement
Abstract: This work focuses on improving upon pre-trained feature representations for learning functional and semantic priors for embodied AI tasks. Specifically, we propose a GCN-based training pipeline that fine-tunes the CLIP embeddings to effectively estimate object-room affinities. Our approach, CLIPGraphs, efficiently combines human commonsense domain knowledge, multimodal information from language and vision inputs(leveraging the strengths of CLIP); and a Graph Network to encode these functional/semantic priors. We experimentally demonstrate the effectiveness of our approach on a benchmark dataset of object categories, showing a significant improvement over state-of-the-art baselines. The learned embeddings from our approach can be used as priors in downstream embodied AI tasks such as object navigation and scene rearrangement, demonstrating the broad applicability of our method.
0 Replies

Loading