Abstract
Estimating the 6D pose of arbitrary objects from a single reference image is a critical yet challenging task in robotics, especially considering the long-tail distribution of real-world instances. While category-level and model-based approaches have achieved notable progress, they remain limited in generalizing to unseen objects under one-shot settings. In this work, we propose a novel pipeline for fast and accurate one-shot 6D pose and scale estimation. Leveraging recent advances in single-view 3D generation, we first build high-fidelity textured meshes without requiring known object poses. To resolve scale ambiguity, we introduce a coarse-to-fine alignment module that estimates both object size and initial pose by matching 2D-3D features with depth information. We then generate a diversified set of plausible 3D models using text-guided generative augmentation and render them with Blender to synthesize large-scale, domain-randomized training data for pose estiamtion. This synthetic data bridges the domain gap and enables robust fine-tuning of pose estimators. Our method achieves state-of-the-art results on several 6D pose benchmarks, and we further validate its effectiveness on a newly collected in-the-wild dataset. Finally, we integrate our system with a dexterous hand, demonstrating its robustness in real-world robotic grasping tasks. All code, data, and models will be released to foster future research.
Figure 1: Teaser.
Overview
Figure 2 illustrates the overall pipeline of our method. Given an anchor RGB-D image IA containing an object of interest, our primary challenge is to estimate its 6D pose without a pre-existing 3D model, a common limitation for novel objects. To address this, as shown in the top-left of Figure 2, we first leverage recent advancements in single-view 3D generation to create a textured 3D model with a standardized orientation and scale (see Section 3.3). However, this generated model exists in a normalized space and lacks real-world scale. To recover the object's true size and location in the anchor image frame, we introduce a coarse-to-fine alignment module (see Section 3.4). This module aligns the normalized generated model with the partial object observation in IA, simultaneously estimating the object's metric scale and initial 6D pose. Once the metric-scale model in the anchor view is established, we can efficiently estimate the object's pose in subsequent query RGB-D images IQ (top-right of Figure 2) using the aligned model and a robust pose estimation framework, including a pose selection module to handle potential object symmetries. The final relative transformation TA→Q is then computed from the absolute poses in both views. Furthermore, recognizing the domain gap between synthetically generated models and real-world images, as depicted in the lower section of Figure 2, we propose a text-guided generative augmentation strategy (see Section 3.5) to create a diversified set of plausible 3D models. These diversified models are then used to synthesize a large-scale, domain-randomized training dataset, enabling robust fine-tuning of the pose estimation components and bridging the sim-to-real gap, as demonstrated in our experimental results (see Section 4).
Figure 2: Overview of One-2-3-Pose.
Full Video
Experiments
Public datasets. We evaluated our method on three challenging public datasets: YCBInEOAT (robotic interaction), Toyota-Light (TOYL) (challenging lighting), and LINEMOD Occlusion (LM-O) (cluttered, occluded, textureless objects).
Real-world evaluation. We performed two experiments in real-world settings: (1) 6D pose estimation for uncommon objects by generating synthetic training data via our domain randomization pipeline and testing on a calibrated real set, and (2) robotic manipulation tasks, establishing grasping setups using a ROKAE robot arm equipped with an XHAND1 dexterous hand, and two AgileX PiPERs, and measuring success rates against baselines.
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Anchor Image
Aligned Model
Original Video
Rendered Video
6D Pose Video
Figure 3: Qualitative comparison on the YCBInEOAT dataset. .