Structure from Diffusion: Taming Video Diffusion Models for Camera Pose Estimation in Dynamic Videos

Sihan Liu; Zhuoyuan Wu; Heng Yu; Jun Gao; Jose M. Alvarez

Structure from Diffusion: Taming Video Diffusion Models for Camera Pose Estimation in Dynamic Videos

Sihan Liu, Zhuoyuan Wu, Heng Yu, Jun Gao, Jose M. Alvarez

Published: 02 Mar 2026, Last Modified: 15 Apr 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reconstruction, geometric foundation model, world model

Abstract: Our research addresses the challenge of accurately predicting camera poses for in-the-wild dynamic videos—a task essential for applications in augmented reality, robotics, and visual perception systems. Unlike structured, lab-controlled environments, in-the-wild videos present diverse, complex scenes with significant variability in lighting, motion, and camera movement, making accurate pose estimation a persistent challenge. To tackle this, we propose a novel video diffusion model designed for camera pose prediction. Our model retargets a video generation model as a pose estimation tool by connecting a ray prediction model with a video encoder. Our model distills strong priors from pre-trained video generation models for camera motion and scene dynamics, leveraging the intrinsic temporal continuity of video features to ensure smooth and accurate pose estimation. We evaluate our approach on both dynamic and static datasets, demonstrating state-of-the-art performance. Compared to existing methods, our model achieves significant improvements in both accuracy and robustness, particularly in challenging real-world scenarios. Code will be open-sourced.

Submission Number: 95

Loading