Keywords: Object Referral, Pointing Gesture, 3D Segmentation
TL;DR: We propose a new benchmark for the evaluation of pointing gesture based object referral in 3D point clouds.
Abstract: Pointing gestures provide a natural and efficient way to communicate spatial information in human-machine interaction, yet their potential for 3D object referral remains largely under-explored. To fill this gap, we introduce the task of pointing-based 3D segmentation. In this task, given an image of a person pointing at an object and the 3D point cloud of the environment, the goal is to predict the 3D segmentation mask of the referred object. To enable the standardized evaluation of this task, we introduce POINTR3D, a curated dataset of over 65,000 frames captured with three cameras across four indoor scenes, featuring diverse pointing scenarios. Each frame is annotated with the information of the active hand, the corresponding object ID, and the 3D segmentation mask of the object. To showcase the application of the proposed dataset, we further introduce Pointing3D, a transformer-based architecture that predicts the pointing direction from RGB images and uses this prediction as a prompt to segment the referred object in the 3D point cloud. Experimental results show that Pointing3D outperforms other strong baselines we introduce and lays the groundwork for future research. The dataset, source code, and evaluation tools will be made publicly available to support further research in this area, enabling a natural human-machine interaction.
Supplementary Material: zip
Spotlight: mp4
Submission Number: 722
Loading