Abstract: Recent transformer techniques have achieved promising performance boosts in visual object tracking, with their capability to exploit long-range dependencies among relevant tokens. However, a long-range interaction can be achieved only at the expense of huge computation, which is proportional to the square of the number of tokens. This becomes particularly acute in online visual tracking with a memory bank containing multiple templates, which is a widely used strategy to address spatiotemporal template variations. We address this complexity problem by proposing a memory prompt tracker (MPTrack) that enables multitemplate aggregation and efficient interactions among relevant queries and clues. The memory prompt gathers any supporting context from the historical templates in the form of learnable token queries, producing a concise dynamic target representation. The extracted prompt tokens are then fed into a transformer encoder–decoder to inject the relevant clues into the instance, thus achieving improved target awareness from the spatiotemporal perspective. The experimental results on standard benchmarking datasets, i.e., UAV123, TrackingNet, large-scale single object tracking benchmark (LaSOT), and generic object tracking benchmark (GOT)-10k, demonstrate the merit of the proposed memory prompt in obtaining an efficient and promising tracking performance, as compared with the state-of-the-art.
Loading