Interpretable Human Action Recognition: A CNN-GRU Approach with Gradient-weighted Class Activation Mapping Insights

Md. Sabir Hossain; Mufti Mahmud; Md Mahfuzur Rahman

Interpretable Human Action Recognition: A CNN-GRU Approach with Gradient-weighted Class Activation Mapping Insights

Md. Sabir Hossain, Mufti Mahmud, Md Mahfuzur Rahman

Published: 19 Jun 2025, Last Modified: 12 Jul 20254th Muslims in ML Workshop co-located with ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Track 1: Machine Learning Research by Muslim Authors

Keywords: Human Action Recognition (HAR), CNN-GRU Architecture, Interpretability, Grad-CAM, Spatio-Temporal Modelling

TL;DR: An interpretable and accurate CNN-GRU-based framework with Grad-CAM for real-time human action recognition.

Abstract: Human Action Recognition (HAR) is essential in applications like healthcare, surveillance, and smart environments, where reliable and interpretable decision-making is critical. While Convolutional Neural Networks (CNNs) and Gated Recurrent Units (GRUs) effectively model spatial and temporal patterns, their black-box nature limits transparency in safety-sensitive domains. This study introduces an interpretable HAR framework combining a CNN-GRU architecture with Gradient-weighted Class Activation Mapping (Grad-CAM). The CNN captures frame-wise spatial features, GRUs model temporal dynamics, and a 3D convolution bridges spatial-temporal abstraction. Grad-CAM provides frame-level heatmaps to visualize model rationale. Evaluated on 10 diverse classes from the UCF101 dataset, our model achieved 96.50\% accuracy and outperformed several standard deep models across precision, recall, and F1 metrics. Visual analysis of correct and incorrect cases confirms both model reliability and interpretability. The framework offers a robust and transparent solution for real-time HAR in critical domains.

Submission Number: 21

Loading