Abstract: In recent years, graph convolution networks (GCN) have been widely used in skeleton-based action recognition to pursue higher accuracy. In general, traditional approaches directly integrate different modalities together using uniform fusion weights, which results in inadequate information fusion across modalities, sacrificing flexibility and robustness. In this paper, we explore the potential of adaptively fusing different modalities, and deliver a new fusion algorithm, coined Multi-modal Stream Fusion GCN (MSF-GCN). In principle, our proposed algorithm consists of three branches: JS-GCN, BS-GCN, and MS-GCN, corresponding to joint, bone, and motion modeling, respectively. In our design, the motion patterns between the joint and bone modalities are dynamically fused using an MLP layer. After conducting typical motion modeling, the static joint and bone branches are accompanied to perform the final fusion for the category predictions. Our MSF-GCN emphasizes static and dynamic fusion simultaneously, which greatly increases the interaction degree between the information of each modality, with improved flexibility. The proposed fusion strategy is applicable to different backbones, exhibiting the power to boost performance with marginal computation increase. Extensive experiments on a widely-used NTU-RGB+D dataset demonstrate that our model can achieve better or comparable results to current solutions, reflecting the merit of our fusion strategy.
Loading