Evaluating Outputs Fusion Technique in Multimodal Human Activity Recognition: Impact of Modality Reduction on Performance Efficiency

Abdelkader Tagmouni, Youssef Elmir, Mohand Tahar Kechadi

Published: 2024, Last Modified: 14 May 2025ICSPIS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Human Activity Recognition (HAR) using deep learning has seen significant advancements, particularly in the development of multimodal approaches that integrate diverse data sources to enhance recognition accuracy. This study introduces a novel multimodal architecture designed to fuse skeletal and inertial data for improved HAR performance. The proposed methodology employs dedicated Convolutional Neural Networks (CNNs) for each modality, followed by a late fusion of their outputs to capture spatial-temporal patterns more effectively. We evaluate the robustness and generalizability of the proposed approach using two multimodal datasets, UTD-MHAD and CZU-MHAD, across 11 distinct experimental trials. The results show that our model outperforms existing state-of-the-art methods, achieving an accuracy of 99.42% on the CZU-MHAD dataset and 99.22% on the UTD-MHAD dataset. This significant improvement demonstrates the efficacy of our multimodal fusion strategy. This work underscores the potential of combining skeletal and inertial data in deep learning frameworks to achieve high accuracy in HAR, particularly in scenarios requiring remote patient monitoring and elderly care.