CLIPGraphs: Multimodal Graph Networks  to Infer Object-Room Affinities

Ayush Agrawal; Raghav Arora; Ahana Datta; Snehasis Banerjee; Brojeshwar Bhowmick; Krishna Murthy Jatavallabhula; Mohan Sridharan; Madhava Krishna

CLIPGraphs: Multimodal Graph Networks to Infer Object-Room Affinities

Ayush Agrawal, Raghav Arora, Ahana Datta, Snehasis Banerjee, Brojeshwar Bhowmick, Krishna Murthy Jatavallabhula, Mohan Sridharan, Madhava Krishna

Published: 07 May 2023, Last Modified: 04 Aug 2025ICRA-23 Workshop on Pretraining4Robotics LightningReaders: Everyone

Keywords: Common Sense Knowledge, Graph Convolutional Networks, Knowledge Graphs, LLMs, Semantic Priors

TL;DR: This paper introduces CLIPGraphs, a novel method for determining the best room to place an object in, for the task of embodied scene rearrangement

Abstract: This work focuses on improving upon pre-trained feature representations for learning functional and semantic priors for embodied AI tasks. Specifically, we propose a GCN-based training pipeline that fine-tunes the CLIP embeddings to effectively estimate object-room affinities. Our approach, CLIPGraphs, efficiently combines human commonsense domain knowledge, multimodal information from language and vision inputs(leveraging the strengths of CLIP); and a Graph Network to encode these functional/semantic priors. We experimentally demonstrate the effectiveness of our approach on a benchmark dataset of object categories, showing a significant improvement over state-of-the-art baselines. The learned embeddings from our approach can be used as priors in downstream embodied AI tasks such as object navigation and scene rearrangement, demonstrating the broad applicability of our method.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/clipgraphs-multimodal-graph-networks-to-infer/code)

0 Replies

Loading