Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models

Published: 24 Apr 2024, Last Modified: 24 Apr 2024ICRA 2024 Workshop on 3D Visual Representations for Robot ManipulationEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scene Representation, Hand-eye Calibration, Foundation Models
Abstract: Representing the environment is a central problem for robots and is a prerequisite for downstream decision-making and motion planning. In the past, constructing a representation from a manipulator-mounted camera requires carefully calibrating the camera with a specific external calibration marker, such as a checkerboard, in advance. Recent advances in computer vision have given rise to \emph{3D foundation models}: large pre-trained neural network models, capable of fast and accurate multi-view correspondence with very few images, and in the absence of rich visual features. This paper advocates for integrating 3D foundation models into scene representation approaches for robot systems with a manipulator-mounted RGB camera. In particular, we propose the Joint Calibration and Representation (JCR) method. JCR leverages RGB images captured by a manipulator-mounted camera, employing them to simultaneously construct a representation of the environment and calibrate the camera. We demonstrate the ability of JCR to build accurate scene representations with a low-cost RGB camera attached onto a manipulator without additional calibration.
Submission Number: 11
Loading