Video Diffusion Model for Point Tracking

13 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Diffusion Models, Point Tracking, Visual Correspondence
Abstract: Point tracking aims to estimate pixel trajectories across video frames but remains challenging under large displacements, occlusion, and real-world artifacts. Conventional trackers, built on image-centric backbones and synthetic training, often fail in these settings. We revisit this problem through the lens of video diffusion models based on Diffusion Transformers (DiTs), whose 3D global attention structure and large-scale training naturally provide global temporal context and real-world priors. We first analyze the intrinsic robustness of video DiT features, showing stronger correlation maps than supervised ResNet backbones even under occlusion and motion blur. To fully exploit these properties, we introduce an upsampler that restores spatial detail while fusing multi-layer features, followed by an iterative refiner for high-precision trajectories. Extensive experiments on TAP-Vid benchmarks demonstrate that our framework achieves superior robustness and accuracy compared to existing backbones, establishing video DiTs as powerful foundations for point tracking.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4685
Loading