Learning Cross-View Consistent 3D Keypoints for Object 6D Pose Estimation

Shaobo Zhang, Wanqing Zhao, Ziyu Guan, Wei Zhao, Jinye Peng, Jianping Fan

Published: 2025, Last Modified: 21 Jan 2026IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Accurate 6D object pose estimation from RGB images is crucial for various computer vision applications, such as augmented reality, robotic manipulation and autonomous driving. Existing methods often rely on extensive labeled data, either manually annotated or synthetically generated, which can be laborious and impractical for real-world deployment. To address these challenges, we propose OK-POSE, a keypoint-based 6D object pose estimation method that leverages relative transformations between viewpoints for training. By utilizing pairs of images with object annotations and relative transformation information, OK-POSE automatically learns to detect 3D keypoints of objects, enabling geometrically and visually consistent pose estimation. The simplicity and accessibility of obtaining relative transformation information, which can be acquired from inexpensive binocular cameras or common smartphone devices, significantly reduce labeling costs and mitigate domain gap issues associated with synthetic data. Experimental results demonstrate that OK-POSE achieves competitive performance compared to methods relying on explicit 3D annotations or object 3D models. Moreover, we provide insights into the data collection process and introduce OK-POSE++, an enhanced version with optimized network architecture and loss functions, yielding further improvements in performance. Our approach offers a practical solution for 6D object pose estimation, suitable for real-world applications in scenarios where extensive 3D annotations or object models are unavailable. The code is released at https://github.com/acmff22/OKPOSE.

External IDs:dblp:journals/tcsv/ZhangZGZPF25