Keywords: Diffusion, Attention, Train-free, VOS, Inversion
TL;DR: We introduce \drift{}, which leverages diffusion self-attention with test-time optimizations for cross-frame label propagation, achieving state-of-the-art zero-shot performance on four VOS benchmarks.
Abstract: Diffusion models, though developed for image generation, implicitly capture rich semantic structures. We observe that their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation without training. We further enhance this process with test-time optimizations: DDIM inversion for semantically aligned representations, textual inversion for object-specific cues, and adaptive head weighting to combine complementary attention patterns. Our framework, \drift{}, achieves state-of-the-art zero-shot performance on four standard VOS benchmarks, rivaling supervised approaches and highlighting the strong semantic capture ability of diffusion self-attention.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 5314
Loading