Unifying Motion and Appearance Cues for Visual Tracking via Shared Queries

Published: 01 Jan 2025, Last Modified: 12 Apr 2025IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The rich motion and appearance cues between consecutive frames are crucial for robust visual tracking. However, most existing tracking methods are still limited in designing different components to separately employ corresponding cues and even ignore one of them. This makes them difficult to maintain effective interaction between different cues, thus hindering the models from fostering a comprehensive understanding of the target objects. To address these issues, we propose a unified spatio-temporal cues learning framework (named USCLTrack) that comprehensively mines the variation patterns of targets between consecutive frames in complex video streams. Specifically, USCLTrack firstly aggregates motion and appearance cues into shared queries to provide the bridge of interaction between both cues. Then, it directly generates object locations on the condition of these shared queries in an autoregressive manner, unifying different cues to guide future inferences. To effectively learn multiple spatio-temporal cues aggregated in the shared queries, we develop a spatio-temporal attention mechanism. This mechanism integrates motion cues with appearance cues according to the time steps for ensuring temporal consistency. Moreover, it concurrently captures motion trends and appearance changes to facilitate the understanding of the target objects. Extensive experiments on eight popular tracking benchmarks validate the effectiveness of the proposed USCLTrack.
Loading