HierGAT: hierarchical spatial-temporal network with graph and transformer for video HOI detection

Published: 01 Jan 2025, Last Modified: 16 May 2025Multim. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Different from traditional video-based HOI detection, which is confined to segment labeling only, the task of joint segmentation and labeling for video HOI requires predicting human sub-activity and object affordance labels while delineating their segment boundaries. Previous methods mainly rely on frame-level and segment-level features to predict segmentation boundaries and labels. However, recognizing the significance of inter-frame and long-term temporal information is imperative. Therefore, to address this task and delve deeper into the temporal dynamics of human–object interactions, we propose a novel Hierarchical spatial-temporal network with Graph And Transformer (HierGAT). This framework integrates two branches: a temporal-enhanced recurrent graph network (TRGN) and parallel transformer encoders (PTE), aimed at extracting hierarchical temporal features from video data. We first augment the temporal aspect of the recurrent graph network by incorporating inter-frame interactions to capture spatial-temporal information within and across frames. Considering the auxiliary role of adjacent frames, we also propose a grouped fusion mechanism to fuse the obtained interaction information. The parallel transformer encoders branch consists of two parallel transformer encoders to extract spatial and long-term temporal information in the video. By leveraging the outputs from these branches, our model fully exploits spatial-temporal information to predict segmentation boundaries and labels. Experimental results across three datasets demonstrate the effectiveness of our approach. All the codes and data can be found at https://github.com/wjx1198/HierGAT.
Loading