Diffusion Models Are Training-Free Object Trackers

Youngseo Kim; Dohyun Kim; Geonhee Han; Paul Hongsuck Seo

Diffusion Models Are Training-Free Object Trackers

Youngseo Kim, Dohyun Kim, Geonhee Han, Paul Hongsuck Seo

15 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion, Attention, Train-free, VOS, Inversion

TL;DR: We introduce \drift{}, which leverages diffusion self-attention with test-time optimizations for cross-frame label propagation, achieving state-of-the-art zero-shot performance on four VOS benchmarks.

Abstract: Diffusion models, though developed for image generation, implicitly capture rich semantic structures. We observe that their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation without training. We further enhance this process with test-time optimizations: DDIM inversion for semantically aligned representations, textual inversion for object-specific cues, and adaptive head weighting to combine complementary attention patterns. Our framework, \drift{}, achieves state-of-the-art zero-shot performance on four standard VOS benchmarks, rivaling supervised approaches and highlighting the strong semantic capture ability of diffusion self-attention.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 5314

Loading