LSTT:LONG SHORT-TERM TRANSFORMER FOR VIDEO SMALL OBJECT DETECTION

Wenbo Liu; Jinsheng Xiao; Yang Jianfeng; Yunhua Chen; Zhongyuan Wang; Jiayi Ma

LSTT:LONG SHORT-TERM TRANSFORMER FOR VIDEO SMALL OBJECT DETECTION

Wenbo Liu, Jinsheng Xiao, Yang Jianfeng, Yunhua Chen, Zhongyuan Wang, Jiayi Ma

13 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long short-term ; transformer; Small Object Detection; Video Object

TL;DR: This paper proposes a novel end-to-end Long Short-Term Transformer network for small object detection in videos.

Abstract: Detecting small objects in video sequences is crucial, yet it poses significant challenges due to their limited visibility and dynamic nature, which complicates accurate identification and localization. Traditional methods often employ a uniform aggregation strategy across all frames, neglecting the unique spatiotemporal relationships of small objects, which results in insufficient feature extraction and diminished detection performance. This paper introduces a long short-term transformer network specifically designed for small object detection in videos. The model integrates features from both long-term and short-term frames: long-term frames capture global contextual information, enhancing the model’s ability to represent background scenes, while short-term frames provide dynamic information closely related to the current detection frame, thereby improving the feature representation of small objects. A dynamic query generation module optimizes query generation based on the implicit motion relationships of targets in shortterm frames, adapting to the current video framework. Additionally, the network employs a progressive sampling strategy—densely sampling short-term frames and sparsely sampling long-term frames—to effectively model video scenes. A spatio-temporal alignment encoder further enhances pixel-level features by accounting for temporal and spatial transformations. Extensive experiments on the VisDrone-VID and UAVDT datasets demonstrate the method’s effectiveness, with an average detection precision increase of 1.4% and 2.1%, respectively, highlighting its potential in small object video detection.

Primary Area: learning on time series and dynamical systems

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 508

Loading