Articulated Object Understanding from a Single Video Sequence

Arslan Artykov, Clémentin Boittiaux, Vincent Lepetit

Published: 09 Jul 2025, Last Modified: 06 Aug 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: This paper introduces a novel method for estimating the structure and joint parameters of articulated objects from a single casual video, captured by a potentially moving camera. Unlike previous works that rely on multiple static views or a priori knowledge of the object category, our approach leverages 2D point tracking and depth map prediction to generate 3D trajectories of points on the object. By analyzing these trajectories, we generate and evaluate hypotheses about joint parameters, selecting the best combination using the Bayesian Information Criterion (BIC) to avoid overfitting. We then optimize a dense 3D model of the object using Gaussian Splatting, guided by the selected joint hypotheses. Our method accurately recovers the geometry, segmentation into parts, joint parameters, and motion of each part, enabling the rendering of the object from new viewpoints and under new articulation states. Extensive evaluations on several datasets demonstrate the effectiveness of our approach.