Point Prompting: Counterfactual Tracking with Video Diffusion Models

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video diffusion models, tracking, tracking any point, diffusion, corresponding, matching, video generation
TL;DR: propose a simple and effective zero-shot tracking approach: by placing a colored marker in the first frame, we guide the model to propagate the marker across frames, following the underlying video’s motion.
Abstract: Recent advances in video generation have produced powerful diffusion models capable of generating high-quality, temporally coherent videos. We ask whether space-time tracking capabilities emerge automatically within these generators, as a consequence of the close connection between synthesizing and estimating motion. We propose a simple but effective way to elicit point tracking capabilities in off-the-shelf image-conditioned video diffusion models. We simply place a colored marker in the first frame, then guide the model to propagate the marker across frames, following the underlying video’s motion. To ensure the marker remains visible despite the model’s natural priors, we use the unedited video's initial frame as a negative prompt. We evaluate our method on the TAP-Vid benchmark using several video diffusion models. We find that it outperforms prior zero-shot methods, often obtaining performance that is competitive with specialized self-supervised models, despite the fact that it does not require any additional training.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4825
Loading