Abstract: Few-shot action recognition seeks to classify new action categories using only a few labeled video samples as reference. Due to the lack of sufficient training samples and the complex structure of video data, it is difficult to extract global features that can be directly used for classification. Therefore, most previous works adopt image encoders to extract the features of each frame individually, and then perform temporal fusion and alignment for the query and support features. However, they neglect the importance of sufficient spatiotemporal modeling and relationships with other categories in the few-shot task when extracting features, rendering them less effective at distinguishing between the given action classes, especially those that require the perception of local motion. In this paper, we present Hierarchical Task-aware Temporal Modeling and Matching (HTTMM) to better perceive critical motion patterns and extract task-relevant discriminative features for matching. Specifically, we propose a task guidance generator and a hierarchical task-aware temporal module. The former collects samples across all categories to get task contextual information and generate task guidance. The latter performs hierarchical spatiotemporal modeling in a task-aware manner by incorporating the task guidance and leveraging the multi-stage visual features of the image encoder. This design makes the extracted features rich in motion cues and more discriminative among the new classes in the task. Based on the global and local features obtained, we further propose a simple matching strategy that takes multi-level relationships into consideration to improve the robustness of matching. To demonstrate the effectiveness of our method, we evaluate it on four commonly used datasets, i.e., Kinetics, UCF101, HMDB51, and Something–Something v2. The experimental results show that our method outperforms the existing state-of-the-art methods.
External IDs:dblp:journals/ijon/ZhanPWZS25
Loading