Abstract: With the assistance of language descriptions, Visual-Language (VL) object tracking can obtain more accurate semantic information compared to traditional Visual-Only object tracking. However, the ability of current VL trackers to obtain target semantic information has not been fully developed due to limitations such as wasted modeling capabilities and insufficient utilization of historical temporal information. On the one hand, the modeling output from Transformer shallow encoders often does not directly participate in the prediction of tracking results, resulting in a certain degree of model capability waste. On the other hand, the semantic information of historical tracking results has also not been fully utilized in the tracking process, resulting in a certain degree of lack of semantic assistance capability. Therefore, we propose a novel hierarchical multi-stage VL tracker called SIEVL-Track to enhance target semantic information. Specifically, we first design a multi-stage visual language tracking framework for modeling multi-scale semantic information in Visual-Language tracking pipeline. Secondly, we propose a selective deep and shallow semantic information fusion module (S-DSFM) that explicitly integrates shallow output features into deep output features, so to reduce the waste of modeling capabilities and obtain more high-frequency semantic information related to the target. Finally, we design a temporal cue modeling module based on linguistic classification and multi-frame historical information(MHLS-TCM), with the aim of more comprehensive utilization of historical temporal semantic information. Benefit from the above designs, our VL tracker can obtain stronger target semantic information. Competitive performance from extensive experimental results on five popular vision-language tracking benchmarks, including LaSOT, OTB99-Lang, WebUAV-3M, LaSOText and TNL2K, have demonstrated the superiority and effectiveness of our SIEVL-Track.
External IDs:dblp:journals/tcsv/LiZLMNS25
Loading