AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

20 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 6D pose estimation, multi-view, pose refinement, foundation models
TL;DR: This work presents a multi-view 6D object pose estimation method that generalizes to unseen objects through a novel multi-view optimization approach based on DINOv2 image features.
Abstract: Single-view RGB model-based object pose estimation methods achieve strong generalization performance but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. To address these challenges, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated views and generalizes to unseen objects. The contributions of this work are threefold. First, leveraging powerful, frozen features from a foundation model, AlignPose iteratively minimizes the discrepancy between rendered and observed images across multiple viewpoints, enforcing geometric consistency without object-specific training. Second, robust handling of noisy inputs is achieved by aggregating pose candidates from an arbitrary single-view pose estimator via 3D non-maximum suppression. Third, extensive experiments on three BOP benchmarks (YCB-V, T-LESS, ITODD-MV) show AlignPose sets a new state of the art, especially on challenging industrial datasets where multiple views are readily available in practice.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23974
Loading