Structure from Diffusion: Taming Video Diffusion Models for Camera Pose Estimation in Dynamic Videos
Keywords: Reconstruction, geometric foundation model, world model
Abstract: Our research addresses the challenge of accurately predicting camera poses for in-the-wild dynamic videos—a task essential for applications in augmented reality, robotics, and visual perception systems. Unlike structured, lab-controlled environments, in-the-wild videos present diverse, complex scenes with significant variability in lighting, motion, and camera movement, making accurate pose estimation a persistent challenge. To tackle this, we propose a novel video diffusion model designed for camera pose prediction. Our model retargets a video generation model as a pose estimation tool by connecting a ray prediction model with a video encoder. Our model distills strong priors from pre-trained video generation models for camera motion and scene dynamics, leveraging the intrinsic temporal continuity of video features to ensure smooth and accurate pose estimation. We evaluate our approach on both dynamic and static datasets, demonstrating state-of-the-art performance. Compared to existing methods, our model achieves significant improvements in both accuracy and robustness, particularly in challenging real-world scenarios. Code will be open-sourced.
Submission Number: 95
Loading