A Multilevel Spatiotemporal Attention Network for Tiny Vehicle Detection in Satellite Videos

Furong Shi, Tao Zhang, Zan Gao, Feifei Zhang, Xianbin Wen

Published: 01 Jan 2025, Last Modified: 16 Jan 2026IEEE Journal of Selected Topics in Applied Earth Observations and Remote SensingEveryoneRevisionsCC BY-SA 4.0
Abstract: Vehicle detection in satellite videos is crucial for large-scale traffic monitoring and urban management, yet remains challenging due to the extremely small object size and limited appearance features. Existing methods generally focus on extracting spatiotemporal information at the local pixel level while neglecting the long-range dependencies between vehicle instances. This insufficient spatiotemporal feature aggregation limits the accuracy of detection. To address this issue, we propose a coarse-to-fine multilevel spatiotemporal attention network (MLSTA-Net) for detecting tiny vehicles in satellite videos. Specifically, a pixel-level spatiotemporal attention module is introduced, which leverages motion priors to guide the aggregation of spatiotemporal features at the pixel level, thereby enhancing the fine-grained representation of targets and generating coarse vehicle detection results. Subsequently, an instance-level spatiotemporal attention module is developed to refine initial detections by modeling the spatial relationships of instances within a single frame and the temporal consistency of instances across multiple frames. Experiments on VISO and SAT-MTB datasets demonstrate that the MLSTA-Net method achieves superior performance over state-of-the-art object detection approaches.
Loading