Patchwise Temporal-Spatial Feature Aggregation Network for Object Detection in Satellite Video

Published: 01 Jan 2024, Last Modified: 07 Nov 2024IEEE Geosci. Remote. Sens. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In this letter, we propose a patchwise temporal-spatial feature aggregation (PTFA) network for object detection in satellite video. First, the feature extractor processes the key frame (KF) along with its support frames to ensure comprehensive spatial coverage of potential objects. Subsequently, we model the semantic similarities among instance-level proposals to exploring robust interaction between temporally adjacent support frames and KF. Furthermore, due to the extremely small size of objects in satellite video, we crop the input frames to different patches by the fixed criterion. Then, the temporal-spatial feature aggregation (TSFA) operations are performed on instance-level RoI features, which attains more nuanced and comprehensive descriptors from the explicit high-resolution temporal-spatial features. The patch features are reconstructed to the original one for complementing more valid feature responses. Finally, we compare our PTFA network with many recent works on the SAT-MTB dataset. Extensive experiments demonstrate that our method achieves the state-of-the-art performance than various static image and video object detection (VID) approaches.
Loading