Weakly Semantic Guided Action Recognition

Published: 01 Jan 2019, Last Modified: 13 Nov 2024IEEE Trans. Multim. 2019EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Action recognition plays a fundamental role in computer vision and video analysis. Nevertheless, extracting effective spatial-temporal features remains a challenging task. This paper proposes three simple but effective weakly semantic guided modules (SGMs) for both environment-constrained and cross-domain action recognition. The SGMs are composed of total 3-D convolution and element-wise gated operations; thus, they are efficient and easy to implement. The semantic guidance is obtained in a weakly supervised manner, in which each video clip is labeled with only an action class instead of pixel-level semantics. Benefitting from the semantic guidance, the network [called semantic guided network (SGN)] can focus on the salient parts of the video clips. Consequently, the redundant information can be reduced and the model is more robust to noise. Besides, benefitting from the intrinsic property of SGMs, SGN is totally end-to-end trainable. Quantities of experiments on both environment-constrained (e.g., Penn, HMDB-51, and UCF101) and cross-domain (e.g., ODAR) action recognition datasets demonstrate its effectiveness. Specifically, SGN gets improvements of 3.7%, 2.1%, and 5.2% for Penn, HMDB-51, and UCF-101 than the baseline ResNet3D, respectively, and SGN ranked third place in the ODAR 2017 challenge.
Loading