VoCAPTER: Voting-based Pose Tracking for Category-level Articulated Object via Inter-frame Priors

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Articulated objects are common in our daily life. However, current category-level articulation pose works mostly focus on predicting 9D poses on statistical point cloud observations. In this paper, we deal with the problem of category-level online robust 9D pose tracking of articulated objects, where we propose VoCAPTER, a novel 3D Voting-based Category-level Articulated object Pose TrackER. Our VoCAPTER efficiently updates poses between adjacent frames by utilizing partial observations from the current frame and the estimated per-part 9D poses from the previous frame. Specifically, by incorporating prior knowledge of continuous motion relationships between frames, we begin by canonicalizing the input point cloud, casting the pose tracking task as an inter-frame pose increment estimation challenge. Subsequently, to obtain a robust pose-tracking algorithm, our main idea is to leverage SE(3)-invariant features during motion. This is achieved through a voting-based articulation tracking algorithm, which identifies keyframes as reference states for accurate pose updating throughout the entire video sequence. We evaluate the performance of VoCAPTER in the synthetic dataset and real-world scenarios, which demonstrates VoCAPTER's generalization ability to diverse and complicated scenes. Through these experiments, we provide evidence of VoCAPTER's superiority and robustness in multi-frame pose tracking of articulated objects. We believe that this work can facilitate the progress of various fields, including robotics, embodied intelligence, and augmented reality. All the codes will be made publicly available.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Interactions and Quality of Experience
Relevance To Conference: The proposed paper addresses the significant challenge of category-level online robust 6D pose tracking for articulated objects, presenting VoCAPTER, a novel 3D Voting-based Category-level Articulated Object Pose TrackER. Articulated objects are pervasive in multimedia applications, and their accurate pose estimation is crucial for tasks such as robotics, augmented reality, and embodied intelligence.  Existing approaches primarily focus on predicting 6D poses on static point cloud observations, overlooking the dynamic nature of articulated objects. VoCAPTER stands out by efficiently updating poses between consecutive frames, leveraging both current frame observations and estimated per-part 6D poses from previous frames. This methodology aligns well with multimedia applications where real-time pose tracking is essential for dynamic scenes.  The paper's relevance to the ACM Multimedia conference lies in its contribution to advancing the state-of-the-art in multi-frame pose tracking of articulated objects, a fundamental problem in multimedia analysis and understanding. The evaluation of VoCAPTER across synthetic and real-world datasets demonstrates its generalization ability and robustness, further underlining its significance for multimedia applications.  Overall, the paper's findings hold promise for enhancing various multimedia-related fields, including robotics, augmented reality, and embodied intelligence. Moreover, the commitment to publicly release datasets and codes fosters transparency and reproducibility, benefiting the broader multimedia research community.
Supplementary Material: zip
Submission Number: 281
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview