Drone-HAT: Hybrid Attention Transformer for Complex Action Recognition in Drone Surveillance Videos

Published: 01 Jan 2024, Last Modified: 25 Jan 2025CVPR Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Ultra-high-resolution aerial videos are becoming increasingly popular for enhancing surveillance capabilities in sparsely populated areas. However, analyzing human activities automatically, such as "who is doing what?" in these videos, is desirable to realize their surveillance potential. In contrast, atomic visual action detection has successfully recognized such activities in movie data. However, adapting it to ultra-high resolution aerial videos is challenging because the target persons appear relatively tiny from overhead views and are sparsely located. Additionally, existing atomic visual action detection methods are based on single-label actions. However, people can perform multiple actions simultaneously, so a multi-label approach would be more appropriate. To address these problems, we propose a multi-label action detection/recognition framework using a hybrid attention vision transformer (HAT) to recognize recurrent actions more efficiently. Additionally, a multi-scale, multi-granularity module inside the action recognition transformer block extracts relevant features without redundancy. Using the Okutama Dataset, we demonstrated that our method performs better than existing state-of-the-art methodologies for interpreting aerial videos for human activity.
Loading