Part-level Reconstruction for Self-Supervised Category-level 6D Object Pose Estimation with Coarse-to-Fine Correspondence Optimization
Abstract: Self-supervised category-level 6D pose estimation stands as a fundamental task in computer vision. Nonetheless, existing methods encounter the following challenges: 1) They are impacted by the many-to-one ambiguity in the correspondences between pixels and point clouds. 2) Existing networks struggle to reconstruct precise object models due to the significant part-level shape variations among specific categories. To address these issues, we propose a novel method based on a Coarse-to-Fine Correspondence Optimization (\textbf{CFCO}) module and a Part-level Shape Reconstruction (\textbf{PSR}) module. In the \textbf{CFCO} module, we employ Hungarian matching to generate one-to-one pseudo labels at both region and pixel levels, providing explicit supervision for the corresponding similarity matrices. In the \textbf{PSR} module, we introduce a part-level discrete shape memory to capture more fine-grained shape variations of different objects and utilize it to perform precise reconstruction. We evaluate our method on the REAL275 and WILD6D datasets. Extensive experiments demonstrate that our method outperforms existing methods, achieving new state-of-the-art results.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Our research focuses on improving multimedia and multimodal processing by facilitating the collaboration between 2D image understanding and 3D point cloud reconstruction. We have developed a unique approach that combines a Part-level Shape Reconstruction Module with a Coarse-to-Fine Correspondence Optimization Module to establish accurate correspondences between 2D visual features and 3D spatial geometries. This approach enables precise 3D model generation from 2D images, which is a critical step for applications that require the integration of cross-modal data.
Furthermore, our method promotes precise alignment across modalities, contributing to more robust scene understanding and object recognition, which are significant areas in multimedia research. Our approach presents a strong foundation for future innovations in multimodal technologies, which may have an impact on a wide range of industries from entertainment to autonomous systems. This research enables machines to better understand and interact with the world in a multimodal context similar to human sensory processing.
Submission Number: 3392
Loading