Implicit Coarse-to-Fine 3D Perception for Category-level Object Pose Estimation from Monocular RGB Image
Abstract: Category-level object pose estimation demonstrates robust generalization capabilities that benefit robotics applications. However, exclusive reliance on RGB images without leveraging any 3D information introduces ambiguity in the translation and size of objects, leading to suboptimal performance. In this paper, we propose a framework for category-level pose estimation from a single RGB image in an end-to-end manner, i.e., Feature Auxiliary Perception Network (FAP-Net). To address inaccurate pose estimation caused by the inherent ambiguity of RGB images, we design a coarse-to-fine approach that first harnesses geometry supervision to facilitate coarse 3D feature perception and subsequently refines the features based on pose and size constraints. Experimental results on REAL275 and CAMERA25 demonstrate that FAP-Net achieves significant improvements (14.7% on 10°10cm and 11.4% on IoU50 on the real-scene REAL275 dataset) over the state-of-the-art and real-time inference (42 FPS).
Loading