Towards Universal Modal Tracking With Online Dense Temporal Token Learning

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE Trans. Pattern Anal. Mach. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called UM-ODTrack). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: Video-level Sampling. We expand the model’s inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. Video-level Association. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. Modality Scalable. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our UM-ODTrack achieves a new SOTA performance.
Loading