Autoregressive Temporal Modeling for Advanced Tracking-by-Diffusion

Published: 2025, Last Modified: 14 Jan 2026Int. J. Comput. Vis. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Object tracking is a widely studied computer vision task with video and instance analysis applications. While paradigms such as tracking-by-regression,-detection,-attention have advanced the field, generative modeling offers new potential. Although some studies explore the generative process in instance-based understanding tasks, they rely on prediction refinement in the coordinate space rather than the visual domain. Instead, this paper presents Tracking-by-Diffusion, a novel paradigm for object tracking in video, leveraging visual generative models via the perspective of autoregressive models. This paradigm demonstrates broad applicability across point, box, and mask modalities while uniquely enabling textual guidance. We present DIFTracker, a framework that utilizes iterative latent variable diffusion models to redefine tracking as a next-frame reconstruction task. Our approach uniquely combines spatial and temporal dependencies in video data, offering a unified solution that encompasses existing tracking paradigms within a single Inversion-Reconstruction process. DIFTracker operates online and auto-regressively, enabling flexible instance-based video understanding. It allows us to overcome difficulties in variable-length video understanding encountered by video-inflated models and perform superior performance on seven benchmarks across five modalities. This paper not only introduces a new perspective on visual autoregressive modeling in understanding sequential visual data, specifically videos, but also provides robust theoretical validations and demonstrates broader applications in visual tracking and computer vision.
Loading