Abstract: In the field of the Internet of Medical Things (IoMT), the demand for Human action recognition (HAR) is growing. Due to the limitations of portability and privacy of traditional sensors, many endeavors have made significant progress in 3-D skeleton-based action recognition. However, existing methods ignore potential higher-order semantic information between joints and fail to perceive rapidly changing dynamic details, resulting in frequent confusion of actions with similar motion trajectories. To alleviate this issue, we propose a novel learning confusion representation network (LCR-Net). Specifically, a progressive feature enhancement module is first designed to utilize self-attention to gradually aggregate lower-order features to higher-order features, emphasizing the relative movement between body parts. Second, we design the enhanced spatio-temporal convolution to explore the potential spatio-temporal dependencies between joints by adding a mask matrix and an attention fusion mechanism. To further perceive the spatio-temporal relationships in subtle changes, we divide the interactive sparse-dense pathways at different spatio-temporal resolutions and enhance the complementary information between the two pathways through feature interaction. Finally, the frequency excitation learning module is proposed to efficiently learn the importance of different frequencies by cross-channel modeling, promoting the compactness of actions within classes and the separability of confusion actions. In addition, the lightweight LCR-Net ${}^{\textbf {+}}$ is achieved through model compression optimization to meet the deployment requirements of IoT systems. Comprehensive experiments conducted on three public datasets (NTU-RGB+D60&120, NW-UCLA) demonstrate the superior performance of our model.
Loading