Exploring Spatio-Temporal Graph Convolution for Video-Based Human-Object Interaction RecognitionDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 13 Nov 2023IEEE Trans. Circuits Syst. Video Technol. 2023Readers: Everyone
Abstract: Video-based human-object interaction recognition is a challenging task since the state of objects as well as their correlations change constantly in the video. Existing methods mainly use 3DCNN or use separate components (e.g., GCN + RNN) to model the spatial correlation or the temporal correlation respectively, but ignore modeling spatio-temporal correlations simultaneously and long-term temporal dynamics of objects. In this paper, we propose a novel model, named Spatio-Temporal Interaction Graph Parsing Networks (STIGPN), for human-object interaction recognition in videos. STIGPN captures both spatial and temporal correlations simultaneously and thus can capture intra-frame and inter-frame dependencies efficiently and effectively. To model long-term temporal dynamics of objects, we introduce spatio-temporal feature enhancement, which can improve the detection of the salient human-object interaction pairs. We explore three types of spatio-temporal graph convolutions to simultaneously capture the spatio-temporal correlations and assess their effectiveness as the basic building block of STIGPN. Extensive experiments on CAD-120, Something-Else and Charades datasets show that our proposed solution leads to competitive results compared with the state-of-the-art methods. Code for STIGPN is available at: <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/NingWang2049/STIGPN2</uri>
0 Replies

Loading