End-to-End Unified Dense 3D Geometry and Motion Perception

12 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Reconstruction, Motion Estimation
Abstract: Predicting 3D geometry and motion from videos is crucial for various applications. Most existing methods adopt a two-stage reconstruct-then-tracking pipeline, which first perceives 3D geometry and then exploits this 3D information to track each pixel. They usually employ the conventional iterative tracking strategy and are thus inefficient, especially for dense motion estimation. Moreover, they fail to leverage the complementary motion information for better dynamic reconstruction. To address these limitations, we propose MotionVGGT, an end-to-end unified transformer architecture that simultaneously perceives dense 3D geometry, camera pose, and motion. We introduce a set of geometry, camera, and motion tokens to represent each frame and interact with each other through interleaved frame attention and global attention layers. We then employ multiple heads to decode point maps, camera poses, and 3D motions from the corresponding tokens. Specifically, we design a conditional dense prediction head and use the motion tokens as conditions to modulate the decoding process of geometry tokens to transform them into motions. Our model directly generates dense per-pixel 3D motion fields in a single forward pass without external trackers. By unifying geometry and motion modeling, MotionVGGT further equips visual geometry foundation models with motion awareness. Our MotionVGGT shows a strong generalization ability across diverse visual geometry perception tasks, establishing a practical and universal paradigm for more comprehensive scene understanding.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4595
Loading