Track3R: Joint Point Map and Trajectory Prior for Spatiotemporal 3D Understanding

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: computer vision
Abstract: Understanding the 3D world from 2D monocular videos is a crucial ability for AI. Recently, to tackle this underdetermined task, end-to-end 3D geometry priors have been sought after, such as pre-trained point map models at scale. These models enable robust 3D understanding from casually taken videos, providing accurate object shapes disentangled from uncertain camera parameters. However, they still struggle when affected by object deformation and dynamics, failing to establish consistent correspondence over the frames. Furthermore, their architectures are typically limited to pairwise frame processing, which is insufficient for capturing complex motion dynamics over extended sequences. To address these limitations, we introduce Track3R, a novel framework that integrates a new architecture and task to jointly predict point map and motion trajectories across multiple frames from video input. Specifically, our key idea is modeling two disentangled trajectories for each point: one representing object motion and the other camera poses. This design not only can enable understanding of the 3D object dynamics, but also facilitates the learning of more robust priors for 3D shapes in dynamic scenes. In our experiments, Track3R demonstrates significant improvements in a joint point mapping and 3D motion estimation task for dynamic scenes, such as 25.8% improvements in the motion estimation, and 15.7% in the point mapping accuracy.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 25888
Loading