Motion Stimulation for Compositional Action RecognitionDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 15 May 2023IEEE Trans. Circuits Syst. Video Technol. 2023Readers: Everyone
Abstract: Recognizing the unseen combinations of action and different objects, namely (zero-shot) compositional action recognition, is extremely challenging for conventional action recognition algorithms in real-world applications. Previous methods focus on enhancing the dynamic clues of objects that appear in the scene by building region features or tracklet embedding from ground-truths or detected bounding boxes. These methods rely heavily on manual annotation or the quality of detectors, which are inflexible for practical applications. In this work, we aim to mining the temporal clues from moving objects or hands without explicit supervision. Thus, we propose a novel Motion Stimulation (MS) block, which is specifically designed to mine dynamic clues of the local regions autonomously from adjacent frames. Furthermore, MS consists of the following three steps: motion feature extraction, motion feature recalibration, and action-centric excitation. The proposed MS block can be directly and conveniently integrated into existing video backbones to enhance the ability of compositional generalization for action recognition algorithms. Extensive experimental results on three action recognition datasets, the Something-Else, IKEA-Assembly and EPIC-KITCHENS datasets, indicate the effectiveness and interpretability of our MS block.
0 Replies

Loading