Abstract: Group activity recognition, which aims to simultaneously understand individual action and group activity in video clips, plays a fundamental role in computer vision and video analysis. In this paper, we propose a novel relational inference framework, termed MLST-Former, for individual action and group activity recognition, capturing the various degrees of spatial-temporal dynamic interactions adaptively and jointly among actors and generating reasonable group representations. Specifically, we first design a multi-level spatial-temporal Transformer to capture the miscellaneous actors’ spatial-temporal contextual information to deal with unbalanced interactions between actors. Furthermore, our proposed network is capable of fully mining long-range spatial-temporal dependencies with the virtue of the merge function and cross attention mechanism. We then propose an inter-frame gating fusion mechanism (IGFM) to selectively aggregate the temporal and structural features of the interacting actors. A new multi-task learning strategy, consisting of the classification cost of individual actions and group activities, is also developed. Moreover, we adopt the motion trajectory branch to provide complementary dynamic features for improving recognition performance. A series of ablation studies demonstrate the effectiveness and respective contributions of the different components within the proposed method. Extensive experiments on four public GAR datasets clearly show that our approach can achieve very competitive performance by comparing them with state-of-the-art methods.
Loading