Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Visual object tracking, Spartiotemporal attention
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Visual object tracking based on MixFormer applying spatio-temporal attention.
Abstract: In this paper, we present a unified spatio-temporal attention MixFormer framework for visual object tracking. Within the vision transformer framework, we design a cohesive network consisting of target template and search region feature extraction, cross-attention utilizing spatial and temporal information, and task-specific heads, all operating in an end-to-end manner. Incorporating spatial and temporal attention modules within the network enables simultaneous feature extraction and emphasis, allowing the model to concentrate on target-specific discriminative features despite changes in illumination, occlusion, scale, camera pose, and background clutter. Stacking multiple non-hierarchical blocks allows meaningful features to be extracted while irrelevant features are discarded from the provided target template and search region. The simultaneous spatio-temporal attention module is employed to accentuate target appearance features and alleviate variation in the object state across frame sequences. Qualitative and quantitative analysis, including ablation tests based on various tracking benchmarks, validates the robustness of the proposed tracking methodology.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5087
Loading