Abstract: Online contextual reasoning and association across consecutive
video frames are critical to perceive instances in visual
tracking. However, most current top-performing trackers
persistently lean on sparse temporal relationships between
reference and search frames via an offline mode. Consequently,
they can only interact independently within each
image-pair and establish limited temporal correlations. To
alleviate the above problem, we propose a simple, flexible
and effective video-level tracking pipeline, named ODTrack,
which densely associates the contextual relationships of video
frames in an online token propagation manner. ODTrack receives
video frames of arbitrary length to capture the spatiotemporal
trajectory relationships of an instance, and compresses
the discrimination features (localization information)
of a target into a token sequence to achieve frame-to-frame
association. This new solution brings the following benefits:
1) the purified token sequences can serve as prompts
for the inference in the next video frame, whereby past information
is leveraged to guide future inference; 2) the complex
online update strategies are effectively avoided by the
iterative propagation of token sequences, and thus we can
achieve more efficient model representation and computation.
ODTrack achieves a new SOTA performance on seven benchmarks,
while running at real-time speed. Code and models are
Loading