Abstract: Visual object tracking in satellite videos is essential for remote sensing applications but remains challenging due to small targets, background interference, and appearance changes. Existing methods typically adopt either offline paradigms with high discriminability or online paradigms with strong adaptability, yet neither alone handles the co-occurrence of appearance changes and background interference effectively. To bridge this gap, this study proposes HLSOT, a hybrid learning framework that integrates the complementary strengths of both paradigms. Specifically, a global-spatial adaptive fusion module is designed to dynamically balance the offline and online responses by estimating their global reliability and modulating spatial activations, enabling robust and context-aware fusion. A variation-aware attention module further enhances adaptability through fine-grained feature interaction, capturing both semantic and spatial appearance variations. To improve localization accuracy, particularly for deformable and nonrigid targets, a structure correction head is devised to correct anisotropic localization errors via geometric modeling. Experimental results on the SatSOT, SV248S, and OOTB datasets demonstrate that HLSOT achieves state-of-the-art performance in both robustness and accuracy, particularly under complex scenarios involving multiple simultaneous challenges, while maintaining a real-time speed of over 70 FPS. These results validate the effectiveness and efficiency of the hybrid framework for satellite video object tracking.
External IDs:dblp:journals/staeors/BaiLWLJWLH25
Loading