Building Long-term Spatial Temporal Semantic MapDownload PDF

Published: 07 May 2023, Last Modified: 21 May 2023ICRA-23 Workshop on Pretraining4Robotics SpotlightReaders: Everyone
Keywords: long-horizan, semantic map, spatial-temporal, long-term, assistive robotics
TL;DR: Our work introduces an algorithm to build an open-vocabulary queryable semantic map over long periods
Abstract: Building dynamic 3D semantic maps that scale over days and weeks is central for household robots operating in unstructured real-world environments and interacting with humans over long periods. Such a long-term object map can assist human users by grounding their natural language queries and retrieving the object’s spatial-temporal information. To our knowledge, there does not exist an integrated approach for building a spatial-temporal map that handles days/weeks of diverse robotic sensor data in a partially observable environment, including dynamic objects. Our approach is agnostic to the object recognition algorithms used and the space of user queries in advance. We propose a representation for the long-term spatial-temporal semantic map that enables the robot to answer real-time queries about the unique object instances in an environment. We also present a Detection-based 3-level Hierarchical Association approach (D3A) that builds our long-term spatial and temporal map. Our representation stores a keyframe that best represents the unique objects and their corresponding spatial-temporal information organized in a key-value database. Our representation allows for open vocabulary queries and even handles queries without specific concepts, such as specific attributes or spatial-temporal relationships. We discuss the retrieval performance of our system with a parameterized synthetic embedding detector. When D3A is queried for 59 ground truth objects, the ground truth object instance is found on average in the 5th return frame, while for baseline, the ground truth object can be found in the 20th frame. We also present preliminary results for a self-collected robotics-lab environment dataset of 22 hours. We show that our queryable semantic scene representation occupies only 0.17% of the total sensory data.
0 Replies

Loading