Abstract: Continuous Sign Language Recognition (CSLR) aims to interpret meaning from signers’ postures and movements. Joint-wise correspondences between estimated skeleton data and sign videos provide complementary insights into appearance and motion. In this paper, we propose a Skeleton-aware SlowFast Network(S\(^2\)Net) to effectively capture the appearance and motion information in sign videos. S\(^2\)Net leverages skeleton data in the fast pathway and video data in the slow pathway, progressively integrating both streams of information. Initially, we project both skeleton and video data into a unified graph-structured space and employ a consistent GCN-based architecture for both pathways, then we propose a group-wise cross-attention module to fuse intermediate features between different pathways. Finally, a frame-wise fusion pathway is adopted to integrate the semantic information at the sequence level. Experimental results on three public datasets demonstrate the effectiveness and efficiency of the proposed method.
External IDs:dblp:conf/accv/YangMC22
Loading