Abstract: Human action recognition has been explored in healthcare, sports, and entertainment, with a recent shift toward manufacturing settings for monitoring assembly tasks. Identifying assembly actions is crucial for improving human-robot collaboration and optimizing the assembly process. However, the complexity of assembly tasks poses challenges for action recognition methods, with single-modality methods struggling to capture the complex dynamics and context. We proposed the multimodal ConvLSTM-AssNet and C3D-AssNet methods, which use RGB, RGB-A, and depth data. The models are tested in single, double, and triple stream configurations, with attention mechanisms integrated to focus on relevant features. The proposed models are evaluated on the HA4M dataset. Attention-Guided C3D-AssNet is most accurate for single (RGB-A: 97.10%) and double streams (RGB-A + Depth: 98.84%), while ConvLSTM-AssNet performs best for triple streams (RGB + RGB-A + Depth: 97.30%). This research advances multimodal assembly action recognition for manufacturing applications.
Loading