Graph Representation for Weakly-Supervised Spatio-Temporal Action Detection

Published: 01 Jan 2023, Last Modified: 03 Oct 2024IJCNN 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Spatio-temporal action recognition and localization are crucial in several computer vision applications including video surveillance, video captioning to name a few. However, most of the existing action recognition and localization approaches are for offline use, perform well only on trimmed action clips. Also, they need precise annotations at the clip, frame, and pixel levels which is labor-intensive and thus undermines their usage for real-world large-scale scenarios. In this paper, we propose a weakly-supervised spatio-temporal action recognition and localization based on graph representation in untrimmed videos. More specifically, we propose an efficient graph representation of videos using only the clip level annotations, while existing approaches are either supervised or unsupervised learning approach. For graph construction, the local actions are determined based on the key interesting demeanor in an action clip and assigned the class label the same as that of the clip. This weak annotation impacts both action recognition and localization significantly because the local actions have considerable intra-class variability and inter-class similarity. To handle the intra-class variability and inter-class similarity, we use a weakly-supervised deep multiple instance ranking framework on the local action descriptors. To classify a graph of local actions into one of the action classes, we use a support vector machine along with a graph kernel and then localize the recognized action as a non-cubic shaped-portion of the video based on local actions in the graph. The experimental results show that the proposed approach outperforms the state-of-the-art methods on the three benchmark datasets, namely, THUMOS14, UCF-Sports, and JHMDB-21.
Loading