Abstract: Aerial Action Recognition (AAR) in videos captured by Unmanned Aerial Vehicles (UAVs) plays a vital role in numerous applications. However, current methods related to traditional action recognition primarily cater to fixed or near cameras, and rarely consider the movement disturbance of UAVs, including their varying attitudes and positions. Those characteristics of aerial videos bring moving objects in small regions compared to broad backgrounds and relative movement to the motion of objects, which reflect more sparse and disturbed semantic information for AAR. To address these issues, we present a novel framework, dubbed 3D-Tok, to Select, Expand, and Squeeze original visual tokens for obtaining compact yet diverse semantic-enhanced tokens. In particular, we present a 3D-token selector (3TS) to select complex yet diverse tokens in three channels, capturing the semantic awareness of moving objects in comparatively small regions. Additionally, to get rid of disturbed semantic information caused by the UAV flight, we present an Expand-Squeeze Converter (ESC) to adaptively expand and squeeze the 3D-selected tokens constrained by contrastive loss, thereby suppressing the semantic-irrelevant information and reinforce semantic-relevant information via the interpolation converting. By involving the token selecting, expanding, and squeezing into an all-in-one framework, 3D-Tok shows significant improvements on the UAV-Human dataset(↑9.5%), RoCoG-v2 dataset (↑23.5%), and Drone-Action dataset (↑5.7%).
Loading