RASP: Robot Active Scene Perception With Joint Viewpoint Planning and Depth Completion in Cluttered Environments

Yizhe Liu, Tong Jia, Haiyu Zhang, Guowei Yang, Hao Wang, Dongyue Chen

Published: 01 Jan 2025, Last Modified: 04 Nov 2025IEEE Transactions on Automation Science and EngineeringEveryoneRevisionsCC BY-SA 4.0

Abstract: Capturing dense visual perception in cluttered environments using depth sensors is crucial for downstream robotics tasks. However, occlusions between objects and unreliable depth data make it very challenging for robots to perceive comprehensive and accurate scene information. To address these issues, we propose a novel robot active scene perception method called RASP, which is composed of two parts. First, we introduce a temporal attention-based view planning algorithm, which actively plans the minimum feasible viewpoint sequence based on latent dependencies in all previous observations to maximize the information perception of the cluttered environments. Subsequently, for the unreliable depth data obtained by the depth sensor, especially caused by transparent and specular objects in the scene, we design a geometry-guided depth completion network that fully utilizes the 3D scene information during the progressive perception process. Specifically, multi-level scene geometric features are extracted and projected into the image space, combining with the image features to guide the depth completion step. These two parts are learned jointly to achieve consistent and accurate results. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. Furthermore, we show that our method significantly improves the performance of downstream grasping tasks. Note to Practitioners—With the development of depth sensing technology, consumer-level depth cameras have been widely used in many robotics and automation tasks. However, in cluttered environments, due to the occlusions between objects, it is difficult for depth cameras to capture comprehensive scene information from a single perspective. In addition, depth sensors often generate unreliable depth data when facing transparent and specular objects. To address these challenges, this article proposes a robot active scene perception method that combines view planning and depth completion. Our view planning strategy actively explores the minimum feasible viewpoint sequence to maximize the acquisition of scene information, while the depth completion network accurately restores unreliable depth information. Experimental results show that our method performs well in cluttered environments containing transparent and specular objects. The proposed method provides high-quality perception input for robots to perform intelligent tasks in practice.

External IDs:doi:10.1109/tase.2025.3604455