Efficient Motion Prompt Learning for Robust Visual Tracking

Jie Zhao; Xin Chen; Yongsheng Yuan; Michael Felsberg; Dong Wang; Huchuan Lu

Efficient Motion Prompt Learning for Robust Visual Tracking

Jie Zhao, Xin Chen, Yongsheng Yuan, Michael Felsberg, Dong Wang, Huchuan Lu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Abstract: Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at https://github.com/zj5559/Motion-Prompt-Tracking.

Lay Summary: Tracking specific objects in videos is still challenging because most methods focus only on how things look, ignoring how they move over time. This makes trackers less reliable when objects undergo significant appearance changes or encounter severe distractors. We propose a simple and efficient solution: a motion prompt tracking module that helps trackers better understand object movement. Our method can be easily added to existing trackers without major changes or retraining from scratch. It uses a motion encoder to extract motion patterns from objects’ historical trajectories and combines them with visual information using a fusion decoder with an adaptive weighting mechanism. We tested this motion module on multiple tracking methods across several challenging benchmarks. Results show that it consistently improves tracking robustness with minimal impact on speed.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/zj5559/Motion-Prompt-Tracking

Primary Area: Applications->Computer Vision

Keywords: visual object tracking, temporal encoding, prompt learning

Submission Number: 536

Loading