Abstract: In the field of visual-language(VL) tracking, most existing methods have achieved impressive advancements in multimodal information fusion where the attention mechanism plays a crucial role in VL tasks. However, the way traditional attention mechanisms calculate correlations independently can introduce noise and ambiguity into the attention weights, thus restricting further performance improvements. To address this issue, we propose a multimodal discriminative fusion module. This module innovatively adopts a cross-utilization strategy, combining multimodal discriminative attention (MDA) with multi-head attention, and is committed to exploring the consistency among multimodal relevant vectors. In this way, it enhances effective cross-modal correlations, suppresses incorrect correlations, and thereby realizes the optimization and strengthening of multimodal interactions to achieve the goal of multimodal fusion. In addition, we have proposed a concise, flexible, and efficient VL tracking pipeline named VLDF. This pipeline abandons the complex design of the prediction head and generates the target position through an autoregressive method. This measure not only reduces the model complexity but also enhances the tracking stability. Finally, we have carried out a large number of experiments on benchmark datasets such as TNL2k, LaSOT, LaSOText, and OTB99-Lang. The experimental results fully verify the effectiveness of the proposed method, achieving satisfactory results.
External IDs:dblp:journals/tjs/ZhangYZXJZ25
Loading